
RATAN TATA 
LIBRARY 

DELHI SCHCX)L OF ECONOMICS 


D.U.P. No. 1337-1-81-20,000 

RATAN TATA LIBRARY 

(Delhi University Library System) 

Cl. No. (?> Irf 

Ac. No. Ii of release for loan 

This book should be returned on or before the date last stamped 
below. An overdue charge of Ten Raise will be charged for each day the 
book is kept overtime. 



STATISTICAL CALCULATION 
FOR BEGINNERS 




STATISTICAL CALCULATION 
FOR BEGINNERS 


BY 


E. G. CHAMBERS, M.A. 

Assistant Director of Research in Industrial 
Psychologyt Cambridge; Fell }iv of the 
Royal Statistical Society 



CAMBRIDGE 

AT THE UNIVERSITY PRESS 
1952 






PUBLISHED BY 

TRB SYNDICS OF THE CAMBRIDGE UNIVERSITY PRESS 

London 0£Eloe: Bentley House, n.w. i 
Anierioan Branoh: New York 

Agents for Canada, India, and Pakistan: Macmillan 


Firat Edition 1940 

Eeprinted 1943 

„ 1945 

„ 1946 

„ 1948 

Second Edition 1952 


Printed in Qreat Britain at the Univeratiy Preaa, Cambridge 
(Brooke Crutchley^ Univeraity Printer) 



Preface to Second Edition 


In the ten years that have elapsed since this book was first 
published, certain additional statistical methods of particular 
value to research workers have been developed. These methods, 
presented as simply as possible, are included in this new 
edition. The whole of the original text has been revised and 
some small arithmetical errors corrected. I am grateful to 
those who have pointed out these mistakes. 

The original numbering of the chapters and sections has 
been preserved as far as possible. The chief additional material 
is as follows. An accoimt of the nature and use of the binomial 
distribution is given in Chapter iv. The calculation of t from 
arbitrary origins, of the variance ratio and of the standard 
error of the difference between proportions is included in 
Chapter v. In Chapter vn the Kendall coeflSicient of rank 
correlation is introduced and in Chapter ix a method of fitting 
linear and logarithmic curves to observational data is ex¬ 
plained. Chapters x and xi are entirely new, the latter giving 
a short introduction to the method of analysis of variance. 
There are also some hitherto unpublished tables in the 
Appendices F-K. Exercises on the new material are provided, 
with answers. I am especially grateful to Mr J. W. Whitfield 
and Miss V. R. Cane for helpful comments and criticisms. 


Cambridge 
June 1950 


E.G.C. 



Preface to First Edition 


The purpose of this book is to explain as simply as possible 
how to perform the calculations involved in the commoner 
statistical methods. It is not in any sense a treatise on the 
theory of statistics, only sufficient theory being given to 
enable the student to understand the use and application of 
the methods described. 

No assumption of mathematical ability on the part of the 
reader is made; the calculations described involve the use of 
arithmetic only. A worked example of each method given is 
provided and abundant exercises with answers are supplied. 

Whilst the book is chiefly addressed to students of the bio¬ 
logical sciences, especially Psychology, the methods described 
are fundamental to statistical work and should, it is hoped, 
prove useful to anyone who has to make use of elementary 
statistical methods. 

I should like to express my sincere gratitude to Dr J. 0. 
Irwin, who very kindly read the whole of the manuscript and 
made many extremely helpful criticisms, and to Dr J. Wishart, 
who made some very valuable suggestions when the manu¬ 
script was approaching its final form. I wish to acknowledge 
also the kindness of Professor B. A. Fisher and his publishers, 
Messrs Oliver and Boyd, in allowing me to print extracts from 
various statistical tables given in his book Statistical Methods 
for Research Workers, and in the statistical tables by Fisher 
and Yates, also published by Oliver and Boyd. Full references 
to these two works are made in the text. 

E. G. CHAMBERS 

Cambridge 

August 1940 

« 



Contents 


Chapter I. INTRODUCTION page 1 

Description of certain statistical terms 1 

Notation 2 

References 4 

Use and abuse of statistical methods 5 

Chapter II. AVERAGES 6 

The arithmetic mean 6 

The median 11 

The mode 12 

Chapter III. SCATTER OR DISPERSION 16 

The range 16 

The inter-quartile range 16 

The mean deviation 17 

The standard deviation 17 

The coefficient of variation 21 

Chaper IV. THE NORMAL AND BINOMIAL DISTRIBU- 

TIONS 24 

Testing the normality of a distribution 27 

The binomial distribution and the calculation of probabilities 30 

Chapter V. SIGNIFICANCE OF MEAN AND DIFFERENCE 
. BETWEEN MEANS 36 

Significance of a single mean 37* 

Significance of the difference between means: large samples 

and small siunples ^ 39 

Variance differences: the vcuiance ratio 45 

t 

Significance of the difference between proportions 46 



viii CONTENTS 

Chapter VI. CORBELATION po^e 60 

Product-moment correlation when N is sn^l 52 

Product-moment correlation when N is large 56 

The diagonal summation method 58 

Significance of product-moment correlation: standard error of r 61 
Significance of the difference between two correlations 63 

Mean of several values of r 66 

Partial correlation 66 

Significance of a partial correlation coefficient 67 

Chapter VII. OTHER METHODS OF CORRELATION 69 

The ranking method: (a) Spearman’s coefficient 69 

(b) Kendall’s coefficient 71 

Significance of t 74 

Partial rank correlation 75 

Biserial correlation 75 

Fourfold correlation 80 

Chapter VHI. REGRESSION AND THE CORRELATION 

RATIO 81 

Graphic construction of regression lines 81 

Testing the linearity of regressions 85 

The correlation ratio 91 

Chapter IX. • CONTINGENCY: GOODNESS OF FIT 93 

Contingency 94 

2x2 tables 99 

Yates’s correction 101 

Alternative method of calculation 101^ 

^ 2 X n tables 102 

Goodness of fit 106 

Curve ^tting 108 

Logarithmic curves 111 

, i 

Polynomials 113 



CONTENTS 


IX 


Chapter X. RANKING AND THE AGREEMENT OF 

JUDGES pchgc 116 

Paired comparisons: the coefficient of consistence 116 

Significance of ^ 119 

Tied preferences 119 

The case of m judges: coefficient of agreement 119 

Significance of u 123 

Combination of several rankings 123 

Coefficient of concordance 124 

Significance of W 126 

Chapter XI. INTRODUCTION TO ANALYSIS OF 

VARIANCE 129 

Fundamental nature 129 

One-way classification 131 

Two-way classification 137 

Three-way classification 140 

n-way classification 146 

Answers to Exercises 148 

Appendix A. Table of . nj(ni -f tig — 2)/(wi -f- n,)} 163 

Appendix B, Curve showing values of N necessary for different 

values of r to be significant 167 

Appendix C. Table of values of 6/N(N* — 1) for different values 

ofN 168 

Appendix D. Squares of numbers from 1 to 99 169 

Appendix E. Numerical material for exercises 160 

Appendix F. Values of n(n — 1) {2n -f 6) for different values of n 163 

Appendix G. Values of 2) for different values of i 164 

Appendix H. Values of mn(m — 1) (n --1 )/4 for different valu^ ^ 

of n and m 166 



X 

Appendix L 

Appendix J. 
n and m 

Appendix K. 
and m 

Index 


CONTENTS 

Values of v for different values of n and m paqt 165 

Values of \ 11 ^^^ for different values of 

2 \2/ \2/m-2 


166 


Values of w*(n*—n)/12 for different values of n 


166 

167 



Chapter I 

INTRODUCTION 

l.i. Description of certain statistical terms. The 

statistical methods described in this book are all concerned 
with the treatment of variables. By a variable is meant aj 
quantity which assumes diflferent values that may be measured 
in some appropriate unit. Height, weight, test scores, readings 
on a thermometer, etc., are examples of variables, or variates, 
as they are sometimes called. Variables are usually denoted by 
X or F in this. book. The number of times a particular value of 
a variable occurs in a set of observations is called the frequency 
of occurrence of that value, and a table showing the frequency 
of occurrence of all the values of a variable in a set of observa¬ 
tions is named a frequency distribution table. 

A series of observations may be represented by one value 
which is called an average, and the way in which the different 
values of the variable lie about this average is described as the 
scatter or dispersion of the observations. Measures of averages 
and scatter are descriptive statistics, since they yield in a con¬ 
densed form a description of a whole series of observations. 
This is the first function of statistical method: the other chief 
function is the examination of various hypotheses which are 
made about observational data. 

It is usually impossible to measure all the values of any 
variable, so that the data from a single experiment are only a 
sample drawn from the total population of possible observa¬ 
tions. For example, if the variable is human height, then the 
total population of that variable would be the height of every 
ftian, woman and child ever on earth; it is manifestly impos¬ 
sible to measure all these values and in practice we have to 
content ourselves with measuring the heights of a sample of 
some convenient size. The distribution of the total populsition 
cai) usually be expressed in a mathematical form by using a 
small number of constants or parameters. Obviously we *can 

CSC , , I 



2 INTRODUCTION [l.l- 

never know the exact values of these parameters, since we 
cannot measure the whole population, but we can make 
estimates of them by measuring rAndom samples, that is, 
samples drawn purely at random from the population. These 
estimates are known as statistics, and their accuracy as esti¬ 
mates depends on the size of the sample and the type of 
distribution of the variable. 

In calculating some statistics it is essential to know the 
number of degrees of freedom available for the calculation. The 
conception of ‘degrees of freedom’ is not an easy one for the 
beginner, so that whenever the term is used in this book cate¬ 
gorical rules for determining the number of degrees of freedom 
are supplied. Consideration of the following example may give 
some idea of the meaning of the term. Suppose 100 shillings 
are to be shared amongst 10 boys. We may give as many as we 
like (up to a total of 99) to each of 9 of the boys, but we are 
bound to give the tenth boy what is left over, i.e. we have only 
9 degrees of freedom in sharing out the shillings. If we were 
told further that 60 shillings were to be shared amongst the 
oldest 5 boys and the remainder amongst the youngest 6, we 
should only have 8 degrees of freedom for doing this, since the 
fifth boy in each group would have to have what was left over 
—we should not be free to vary his share. 

Two variables which are related together, so that a know¬ 
ledge of the values of one variable indicates likely values of 
the other, are said to be associated or correlated. If the variables 
are unrelated, they are said to be independent. 

Other terms, applicable to particular methods, will be 
described in their appropriate places. 

l.ii. Notation. A certain amount of symbolism is essen¬ 
tial in the description of statistical methods. Unfortunately 
there is a lack of agreement amongst different authors, which 
is apt to be confusing to the beginner. For this reason an 
attempt at a consistent method of notation is made in this 
book Being based on first principles it is hoped that it will 
be readily understood by the learner and will enable him to 
follow the notation used in the standard books on statistical 



INTRODUCTION 


3 


1.U] 

methods. Symbols in general established use are taken over 
unch&nged. Here again there is confusion, since owing to the 
multiplicity of statistics the same symbol may have to stand 
for quite different quantities. Thus the symbol z is used 
for three different quantities and particular care is needed 
on the part of the student to avoid misconception in such 
cases. 

Certain symbols have the same significance throughout the 
whole of this book. For example, 'S always stands for the total 
number of observations in a sample and S always signifies 
‘the sum of’. The other notation is made as unambiguous as 
possible. Certain Greek letters are used as symbols: a list of 
these with their pronunciation is given below: 


(beta) 

n (pi) 

y (gamma) 

p (rho) 

8 (delta) 

<T (sigma) 

C (zeta) 

T (tau) 

1 ] (eta) 

^ (phi) 

/I (mu) 

V (nu) 

X (hi) 


S (capital sigma) is used in some places to indicate ‘the sum 
ofAs a rule 8 is used for summation of sample values and S 
for derived quantities. 

Care must be taken by the student to avoid confusing 
^ffixesa^^imUce^-S are small numbers or letters 

written after a symbol at the foot, e.g. x^, etc.; these are 
merely descriptive and confine the use of the symbol to a par¬ 
ticular purpose. Indices are small numbers written after and 
above symbols and have their usual algebraical significance; 
for example, (x squared) means x multiplied by x, (y 
cubed) means y multiplied by y multiplied by y, and so on. 

The usual arithmetical symbols, -f, —, x and ~, have their 
accustomed significance. There are three other symbols with 
which the non-mathematical student may not be familiar. 
Vertical lines drawn on each side of a quantity mean*‘the 
positive numerical value of e.g. | a-61 means ‘the positive 
numerical value of the difference between a and 6 ’. Using this 



4 


INTRODUCTION 


_ 

notation, therefore, it does not matter whether we write | a — 61 
or 16 — a I. Secondly, there is the factorial sign, ‘\ This latter 
is best explained by examples, e.g. 4? stands for 4 x 3 x 2 x 1, 

6! for 6 X 6 X 4 X 3 X 2 X 1, and so on. Thirdly, means the 

number of combinations of n things taken p at a time. Ex- 

panded algebraically, 

Since a good deal of arithmetical work is involved in certain 
of the statistical methods described in the following chapters, 
it is an advantage for the student to be familiar with the use of 
logarithms (unless he has a calculating machine available). 


l.iii. References. Reference is made in the following 
pages to two invaluable books on statistics and to certain 
books of statistical tables. The references are made numerically 
to the following works: 

(1 An Introduction to the Theory of Statistics. G. Udny Yule 
and M. G. KendaU. 11th edition. Charles Griffin and Co. 
(Ltd. 1937. 

(2) Statistical Methods for Research Workers. R. A. Fisher. 
9th edition. Oliver and Boyd. 1944. 

(3) Tables for Statisticians and Biometricians, VdiTtl. Edited 
by Karl Pearson. Biometrika Office, University College, 
London. 

(4) Barlow's Tables of Squares, Cubes, Square-roots, Cube-roots 
and Reciprocals of all Integral Numbers up to 12,500. 
E. and F. N. Spon. 4th edition. 1941. 

(5) Statistical Tables for Biological Agricultural and Medical 
Research. Fisherand Yates. Oliver and Boyd. 2nd edition. 
1942. 

(6) The Advanced Theory of Statistics. M.G. Kendall. Charle,s 
Griffin and Co. Ltd. Vol. i, 1945; vol. ii, 1946. 

Since no attempt is made in this book to prove or justify the 

various methods and formulae used, the student wishing to go 

into such matters is referred to the first two and the last of 
< 

the foregoing works. 



INTBODUCTION 


5 


l.iv] 


l,iv. Use and abuse of statistical methods. The 

student who works conscientiously through the following 
chapters should learn How to make use of the commoner 
methods of statistics. He should never forget, however, that 
statistical methods are merely tools for a research worker. 
They enable him to describe, relate and assess the value of his 
observations. They cannot make amends for incorrect obser¬ 
vation nor can they of themselves provide a single fact of 
psychology, biology or any other subject of research. Statis¬ 
tical methods are to the research worker what tools are to a 
carpenter. The latter has first to learn how to use his tools and 
he may then by employing them reveal the useful and beauti¬ 
ful purposes to which his material may be put. But the tools 
themselves raust be used for their correct functions. The 
craftsman will not, for instance, use a mallet and chisel or a 
fretsaw to plane a plank of wood, nor will he use a hammer to 
drive in a screw. In the same way statistical methods must 
only be used by the research worker for the purposes for whu^h 
they have been devised. 

Further, a carpenter’s tools cannot tell him directly any¬ 
thing about the materials he is using. They cannot by them¬ 
selves distinguish between mahogany and deal nor prove that 
oak is more durable than white wood. No carpenter’s tools 
have ever yet made a piece of wood; similarly no statistical 
method has ever yet produced a biological fact. 

The student is advised, therefore, to try to acquire an under¬ 
standing of the specific purpose of each statistical method he 
learns to use, to appreciate the scope of and the assumptions 
underlying the use of each formula, and to realise that the out¬ 
come of each calculation is a statistical statement which has 
to be interpreted in terms of the particular branch of science 
from which the data for examination are drawn. 



Chapter Ijl 

AVERAGES 


2.i. The arithmetic mean. The best known and most 
useful form of average is the arithmetic mean, usually referred 
to ae the ‘mean’ or the ‘average’. It is easily calculated by 
adding together all the observations to be averaged and 
dividing the sum or total by the number of observations. 

Example 1. Find the mean of the following observations: 
22, 24, 20, 23, 21, 19, 23, 22, 20, 22, 20, 22, 23, 26, 21, 21, 22, 
24, 23, 22, 23, 21, 22, 21, 23. 

Add together all the observations. 


The sum 

The number of observations 
The arithmetic mean 


= 649, 

= 26, 

_ sum of observations 
~ no. of observations 


649 

26 


= 21-96. 


This procedure may be generalised to cover all cases. If X 
is a variable which has different values Ai, X^, X^, etc., then 
the arithmetic mean of a number N of such values is the sum 
of the various values of X, which we denote by 8{X), divided 
by N, the number of them. In general, therefore. 


m^ = A = 


8{X) 

N • 


( 1 ) 


Here mj, and X (called A-bar) are different ways of denoting 
‘the mean of A’. 


2.ii. If N is large and no adding machine is available, tlie 
process of addition may be very laborious. It may, however, 
be made easier by the construction of a frequency diatribu- 
tionvtable. This is a table showing how often each value of 
the variable occurs in the sample under consideration. In 
Example 1, the values taken by A all lie between 19 and 26 



AVERAGES 


7 


2.1-2.11] 

inclusive. If we count how many times each different value of 
X occurs and write the totals in tabular form, we obtain the 
frequency distribution tAble given below. 

Table I 


^ / 

19 1 

20 3 

21 6 

22 7 

23 6 

24 2 

25 1 


25 


In this table the first column, headed X, shows the different 
values assumed by the variableX in the sample, and the second 
column, headed/, gives the number of times, or frequency, of 
occurrence of each. The total of the / column is, of course, the 
total number of observations we are averaging, i.e. S(f) = N. 
The next step is to write down a third column, headed /X, 
which is produced by multiplying together the corresponding 
pairs of numbers in the X and / columns. We then sum the 
/X column, giving us S(/X), and the arithmetic mean is then 
obtained by dividing this sum by N as before; i.e. 


= X = 


'■ N ' 


( 2 ) 


Example 2. Calculate the mean of the observations in 
Example 1 by constructing a frequency of distribution table. 


X 

/ 

JX 


19 

1 

19 


20 

3 

60 

L(/JC) = 549, 

21 

6 

105 

iV= 25, 

22 

7 

154 

^ 549 

23 

6 

138 


24 

2 

48 

26 

1 

26 

= 21-96. 


26 

649 



It wdll be noted that this result is identical with tljat 
obtained in Example 1. 



8 


AVERAGES 


[2.iii 

The method of Section 2.ii is useful when the range 
of the X values is small, but if there are many different values 
of X the method again becomes labokous. Suppose the obser¬ 
vations in Example 1 were the lengths of 26 sticks measured 
in centimetres, each one being measured to the nearest centi¬ 
metre. Now let us suppose we had a large number of such 
sticks and measured them in millimetres. The range of the 
measurements might now be from 190 to 252 mm., so that if 
we constructed a frequency distribution table of these we 
should have 63 diiferent values of X to tabulate. This would 

Table II 


X 

/ 

190-193 

2 

194-197 

4 

198-201 

7 

202-205 

12 

206 209 

19 

210-213 

24 

214-217 

27 

218-221 - 

35 

222-225 

26 

226-229 

21 

230-233 

18 

234r-237 

13 

238-241 

6 

242-245 

5 

246-249 

2 

250-253 

1 


222 


be tedious, and the calculation may be shortened, with some 
small sacrifice of accuracy, by subdividing the range of the 
X’s into a convenient number of groups. In practice, a number 
of groups between 12 and 20 should be chosen, and the best 
unit for grouping may be found by dividing the range first by 
12 and then by 20, and taking a convenient unit in between 
these two quotients. For instance, if the range is 63, then the 
results of dividing 63 first by 12 and then by 20 are 6*26 and 
3- ;^6. Hence a convenient working unit for grouping would be 
4 in this case. This means that we should group the values of 



AVERAGES 


9 


2.1ii] 

X together in 4’s, so that the first group would comprise 190, 
191,192 and 193, the second 194,195, 196 and 197, and so on, 
the last group being 250l 261, 252 and 253. We could now con¬ 
struct a frequency distribution table of these groups. Such a 
table might be, for example, as Table II. This gives the fre¬ 
quency distribution of the lengths of 222 sticks measured in 
4 mm. groups. 

Now the method of calculating the mean length of the sticks 
from such a table depends on the assumption that the average 
length of the sticks in each group is equal to the mean value of 
X for that group. For instance, there are 12 sticks in the 202- 
205 group and we shall assume that the average length of those 
12 sticks is equal to the average of 202, 203, 204 and 205, i.e. 
203*5. This is, of course, an assumj)tion, but the larger the 
number of readings in each group the nearer it becomes to 
being true. 

Care should be taken in arranging the grouping that this 
assumption should be as nearly as possible true for the end 
groups, leaving the middle ones, which usually have more 
readings in them, to look after themselves. For instance, if 
the two readings in the 190-193 group were both 190, then 
their average would be 190 instead of 191*5, which is assumed 
by the grouping. In such a case it would have been better to 
have started the grouping from 189, which would assume an 
average for the group of 190*5. 

In order to calculate the mean of the observations in a 
grouped frequency distribution table, such as Table II, we 
take an arbitrary origin, or starting point, and then calculate 
the discrepancy between this point and the true mean. Let us 
take our arbitrary origin near the middle of the range, since 
this simplifies the arithmetic. It is convenient to have it at the 
centre of a group so we will choose the 218-221 group, so that 
our arbitrary origin will be 219*5, the centre of the group. This 
group we number 0. The next group in the table, 222-226, has 
an average of 223*5, which is one group unit, or working unit, 
above the arbitrary origin, and we therefore number^this 
group 1. In a similar manner the 226-229 group is numbered 2, 
and so on. 



10 


AVSBAOBS 


[2.m- 

The 214-217 group averages 216*6, which is one working 
unit less than the arbitrary origin, so this group is nuni(>ered 
— 1. Similarly the 210-213 group beJomes — 2, and so on. 

Example 3. Calculate the mean of the data in Table II. 

X f X fx 

1 90-193 2 -7 -14 

i5J^19^ 4 -6 -24 

193-201 7 -6 -36 

202-206 12 - 4 - 48 

206-209 19 -3 -67 

210-213 24 -2 -48 

214-217 27 -1 

218-221 36 0 

222-226 26 1 M 

226-229 21 2 42 

230-233 18 3 54 

234-237 13 4 52 

238-241 6 6 30 

242-246 6 6 30 

246-249 2 7 14 

260-263 _1 8_8 

222 266 
-263 
3 

♦ Since there will be no entry in the fx column corresponding to 
£C = 0, this is a convenient place to add the negative entries in the fx 
column. 



We can now replace the X column by another column, which 
we will head which indicates tlie number of working units 
that each X-group lies away from the arbitrary origin. The X 
column is now neglected and a fourth column, headed fx, is 
written down. This is obtained by multiplying corresponding 
entries in the / and x columns. By adding this column we get 
S(/a:), and the discrepancy, D, between the true mean and the 
arbitrary origin, in working units, is given by 




N • 


( 3 ) 


Thid quantity D tells us how many working units the true 



AVERAGES 


11 


2.V] 

mean rrig. lies away from the arbitrary origin, which we will 
call Hence the true mean is given by the formula 

m^l= ^ + Dw, (4) 

where w is the size of the working unit. If D turns out to be 
negative, then Dw will have to be subtracted from If it 
should happen that the arbitrary origin is the true mean, as 
might happen with a perfectly symmetrical distribution, then 
D will, of course, be zero. 

The whole process of calculating the mean by this method is 
shown in Example 3, 

We conclude therefore that the mean length of the sticks 
was 220 mm., to the nearest millimetre. 

2.iv. The. median. Another form of average which is 
sometimes used for convenience is the median. This, as its 
name implies, is the middle observation and it is easily found 
in ungrouped data by ranking the sample of observations in 
order of their size and finding the central observations, if N 
is odd, or the mean of the two observations in the middle, if 
N is even. If N is odd, the median will be the {N + l)/2th r 
observation. If N is even, it wiU be the mean of the N /2th and j 
the N/2-h 1th observations. For example, the median of 2,4, 6J 
8, 10, 12 is the mean of the 3rd and 4th readings, i.e. 

(6-f8)/2 = 7. 

The median is representative of a set of observations in the 
sense that there are exactly as many observations greater than 
it as there are less. If the distribution is perfectly symmetrical, 
the median is equal to the mean. 

Example 4. Find the median of the observations in Ex¬ 
ample 1. 

Ranking the observations in order of size we get: 19, 20, 20, 
20, 21, 21, 21, 21, 21, 22, 22, 22, 22; 22, 22, 22, 23, 23, 23, 23, 
23, 23, 24, 24, 25. 

Since there are 25 observations, the median will be the 13th, 
i.e. 22. ^ 

2.V. Finding the median for grouped data involves a little 
approximation, as was the case in finding the mean of grouped 



12 


AVEBAGES 


[2.V- 

data. We assume that the values in each group are evenly 
distributed through the group and the approximate value of 
the median may be foimd by linear in|erpolation. Suppose, for 
example, we wished to find the median of the data in Table II. 
We have to find the value of the observation below which half 
the observations lie. N in this case is 222, so that is 111. 
The median therefore will be the position of the mean of the 
111th and 112th observations. By adding together the first 
seven entries in the/column we find that 95 of the observations 
lie below 217-6,* and there are 35 observations in the 218-221 
group. Obviously, therefore, the median lies somewhere in this 
group. The position may be foimd by simple proportion. 
111 — 96= 16. In the 218-221 group 16 observations will 
therefore be less and 19 greater than the median, assuming that 
all the values in this group are evenly distributed. The group 
contains 4 units, so that the position of the median is given by 
adding 16/36 x 4 to 217*5, the limit of the previous group. 
Hence the median of the data in Table II is 
1 

217-5 + —x4 == 217*5+1*83 = 219*33. 

35 

This may be checked by working from the other end of the 
table. We find that 92 observations lie above 221*5, so that the 
median is 

111 — Q2 

221*5- ——x4 = 221*5-2*17 = 219*33. 

35 

2.vi. The mode. A form of average which is occasionally 
used is the mode. This is the most frequently occurring, or most 
fashionable, observation. For instance, in the data of Example 
1, the mode would be 22, since this observation occurs more 
often than any other. However, the mode cannot usually be 
found as easily as this for small samples, since error6 of 
sampling may result in the frequent occurrence of some 
observation remote from the true mode. 

♦ JNfote that since the measurements in the X column are made to 
the nearest millimetre, any observation up to but not equal to 217*6 
will count as 217 or under, so that the real upper limit of the group is 
217*6. Similarly the real lower limit of the 222-226 group is. 221*6. 



AVERAGES 


13 


2.vii] 

If a graph of the frequency distribution of any va^able for 
the total population were drawn, we should have a smooth 
curve (see Fig. 2, p. 25 f))r an example). In such a curve the 
position of the mode would be given by the highest point, since 
there would be the maximum frequency at that point. Such 
a curve might be symmetrical or it might be asymmetrical, or 
skew. It has been found that in curves of moderate skewness 
the position of the mode is given approximately by the relation 

Mode = mean — 3 (mean - median). 

It is better, therefore, to make use of this equation for finding 
the mode of a sample than to rely on picking out the most 
frequently occurring observation. (We are considering here 
only distributions which give a hump-backed curve and have 
therefore only one mode. Certain distributions may have two 
or more modes, but the elementary student need not concern 
himself with these.) 

Example 6. Find the mode of the data in Table II. 

We have found previously that the mean of the data is 
219*55 and the median is 219*33. Substituting these values in 
the equation 

Mode = mean — 3 (mean — median), 
we have Mode = 219*55-3 (219*55-219*33) 

= 219*55-0*66 
= 218*89. 

, 2.vii. It may be seen from the above equation that if the 
vmean and median are equal the mode is also equal to the mean. 
Thus in the case of a variable which is distributed in a perfectly 
symmetrical manner, the mean, the median and the mode are 
all equal. 

Note on the construction of a frequency distribution table 

If the observations to be tabulated are on cards, the fre¬ 
quency distribution table is easily formed by sorting the cards 
into their appropriate groups and counting the number ^of 
cards in each group. When, however, the data are not on cards. 



14 AVERAGES [2*Vll 

a frequency table may be made by the use of a spot diagram. 
This is made by tabulating the X column and then going 
through the observations one at affcime^and putting a spot 
against the appropriate X-group for each observation. The 
data in Table II would appear as under in a spot diagram. 

Spot diagram of the data in Table II 


X S 
190-193 \ 2 
194-197 ••• 4 
198-201 ; 7 

202-206 I•I I 12 
206-209 19 
210-213 24 
214-217 i 27 
218-221 35 
222-225 • 20 
226-229 • 21 
230-233 :• is 
234-237 13 
238-241 * 0 
242-245 6 
246-249 I 2 
250-253 • . 1 


222 

It helps in counting the spots for the entries in the / column if they 
are put down in groups of five, as above. 

EXERCISES ON CHAPTER II 

1. Find the arithmetic mean of the scores in each of the tests in 
Appendix E from A to O inclusive for 

(a) subjects 1-25, 

(b) subjects 26-60, 

(c) subjects 61-76, 

(d) subjects 76-100. * 

Use formula (1). 

2. Find the arithmetic mean of the scores in test F for 

^ (a) subjects 1-60, 

(6) subjects 61-100. 

Use formula (2). 


















AVERAGES 


16 


^ 3. Construct grouped frequency distribution tables for the scores 
in eachfof the tests A to Q inclusive for the whole 100 subjects. Use 
the following group units and starting points. 


Test 

Group unit 

First group 

A 

4 

14-17 

B 

16 

122-136 

C 

4 

0-3 

D 

4 

4-7 

E 

6 

78-83 

F 

2 

8, 9 

G 

3 

17-19 


(а) Thence, using formulae (3) and (4), calculate the mean scores 
in each of the tests A to O inclusive for the whole 100 subjects. 

(б) Compare the means obtained with those given by adding the 
means of the four groups of 26 subjects found in Exorcise 1 and 
dividing the tote^ls by 4. Th€5se latter means are accurate and will 
indicate the discrepancies introduced by grouping the data. 

4. Find the median of the scores in each of the tests from ^ to G 
inclusive for 

(a) subjects 1-50, 

(h) subjects 61-100. 

6. From the answers to Exercise 1 compute the means of the scores 
of each test from .4 to G inclusive for 

(a) subjects 1-60, 

(h) subjects 61-100. 

Then, making use of the answers to Exercise 4, estimate the mode of 
the scores in each test for 

(a) subjects 1-60, 

\h) subjects 61-100. 

Use the relationship ‘Mode = moan —3 (mean — median)’. 



Chapter IttI 

SCATTER OR DISPERSION 

3.1. Whilst an average to some extent represents the whole 
series of observations of which it is the mean, yet it does not by 
itself convey sufficient information about those observations. 
As a general rule it is also necessary to know how the obser¬ 
vations are scattered around their average. Obviously the 
average of a given number of observations which all lie closely 
about the mean is more reliable as a representative statistic 
than one which is in the middle of a widely dispersed series of 
readings. It is important, therefore, to give an indication of 
the amount of scatter of the observations averaged. 

3.ii. The range. The simplest measure of scatter or dis¬ 
persion is the range of the observations, i.e. the distance be¬ 
tween the largest and smallest observations. The range, how¬ 
ever, is not a good measure of dispersion. It is based on two 
observations only, instead of on all the available information, 
and those two observations are liable to vary considerably in 
different samples, since the range depends essentially on the 
size of the sample. 

Some writers give the range which includes all the observa¬ 
tions except the 10 % smallest and the 10 % lajjfest in an 
attempt to allow for the variability of the extremes of a 
sample, but this practice has little to recommend it. 

S.iii. The inter-quartile range. A measure of dispersion 
which is not quite so subject to fluctuations of sampling is the 
inter-quartile range. This is obtained as follows. First rank all 
the observations in order of size and find the median (see sfec- 
tion 2.iv). This divides the sample into two equal portions. 
Then find the median of each half: that of the lower half is 
called the Imver quartile, that of the upper half the upper 
q^artile. These two quartiles with the median divide the whole 
range of observations into four equal inter-quartile groups. 



SCATTBB OR DISPERSION 


17 


3.i-3.v] 

and the distance between the lower and upper quartiles is 
oalled*the inter-quartile range. It is usual to halve this and 
quote the ‘semi-inter-qu.Jirtile range’. 

This measure of dispersion makes use of slightly more in¬ 
formation than does the total range, but it does not lend itself 
to further mathematical treatment and is not a very valuable 
measure. 

3.iv. The mean deviation. A measure of scatter which 
makes use of all the observations is obtained by writing down 
the difference between each separate observation and the 
average, adding together all these differences without regard 
to their signs, and dividing the total by the number of obser¬ 
vations. This is called the mean deviation or mean variation and 
is best calculated from the median, as it is then a minimum. 

Example 6. Calculate the mean deviation of the data in 
Example 1. 

The median of the observations is 22, as was found in 
Example 4. The differences between each reading and the 
median, without regard to sign, are: 0, 2, 2,1, 1, 3, 1, 0, 2, 0, 2, 
0 , 1 , 3 , 1 , 1 , 0 , 2 , 1 , 0 , 1 , 1 , 0 , 1 , 1 . 

The sum of these differences = 27, 

N = 25, 

27 

The mean deviation = -- = 


3.V. The standard deviation. By far the best and most 
useful measure of scatter is the standard deviation. In words, 
\ this is the square root of the mean of the squares of the devia- 
(^tions of the observations from their arithmetic mean. In 
\ symbols, if (X — X) represents the deviation of an individual 
1 reading from the mean and S{X — X)^ the sum of the squares 
^f all such deviations, then the standard deviation, cr, is given 
by the formula / 


! \ The square of the standard deviation is called the variafce 
I or second moment, the latter usually being denoted by /ij. 



18 


SCATTER OB DISPERSION 


[3.V- 


Hence the variance 

= O’® = = 


8(X-X)^ 


(6A) 


The method of calculating the standard deviation depends 
on the type of data and the number of observations, and 
the following sections show how to calculate it in different 
cases. 


3.vi. When N, the number of observations, is small and 
the mean of the observations is a whole number, the standard 
deviation is easily calculated by subtracting the mean from 
each observation, squaring each of these differences, adding 
the squares, dividing the sum of the squares by N and finally 
taking the square root of the quotient.* 


Example 7. 

Find the standard 

deviation of the first 11 

natural numbers. 

X X-X 

(X-X)^ 


1 

-6 

26 


2 

-4 

16 

S(X) = 66, 

3 

-3 

9 

N= 11, 

4 

-2 

4 

5 

-1 

1 

^ = r! = 6. 

6 

0 

0 

11 

7 

1 

1 

^(x~Z) 2 = no. 

8 

2 

4 

/no 

9 

3 

9 

' ‘ ^ V U 

10 

4 

16 

11 

6 

26 

= VlO = 3*16. 

_ 

.1 — —' 

— 


66 

X) 

no 



3.vii. It rarely happens in practice, however, that the 
mean of a set of observations is a whole number or that N is 
as small as 11. If the mean is fractional, it would be laborioijfs 
to square all the fractional differences between the readings^ 
and their mean. It is possible, however, to avoid this. If the 

different values of the variable are Zg, Zg, etc., then their 

•» 

When the variance in the whole population is being estimated 
from a sample, the sum of squares is divided by — 1 (see 5.ii, p. 37). 



8CATTBB OE DISPERSION 


19 


3.vii] 

differenceafrom the mean are {X^ — X), (Zj— X), (Xj—X), etc. 
Squaring these we get 

(Xi-X)* =1 Xf-2XiX + X*, ■ ■ 

(X2-X)2 = Xi-2X2X + X2, 

(Xa-X)^ = X§- 2 X 3 X + X*, etc. 

Summing these we have 

S(X-X)^ = 8(X^)-2X.8(X) + NX^. 


Divide both sides by N. Then 

8(X-X)^ 8(X^) 8(X) 

But = X, by formula (1); hence > 

N N N 


• N 


(6B) 


In words, the variance is the mean of the squares of the 
observations less the square of their mean. Hence the standard 
deviation of the numbers in Example 7 could have been 
calculated as under ; 


X 

X2 


1 

1 


2 

4 

S(X) = 66, 

3 

9 

N= 11, 

4 

5 

6 

7 

16 

25 

36 

49 

X = 6. 

606 

Variance = -^y — 6® 

8 

64 

= 46-36 

9 

81 

= 10. 

10 

11 

100 

121 

Hence cr = ^10 = 3 

66 

606 



The method is more usefully extended to a frequency dis¬ 
tribution table, fqr which we have the following formula;^ 


0-2 = 


S(/X2) 

N 


t6) 




20 SCATTER OR DISPERSION [3.Vii- 

This may also be written 

The use of this formula is illustrated in the following example. 

Example 8. Find the standard deviation of the data in the 
following frequency distribution table: 


Construct two further columns, as under: 


X 

/ 

fx 

fX^ 

N = S{f) = 30, 

1 

2 

2 

2 

2(/X) = 94, 

2 

6 

12 

24 

SC/Jf*) = 332, 

3 

12 

36 

108 

332 /94\* 

4 

7 

28 

112 

0-2 = 1 

5 

2 

10 

50 

30 \30/ 

6 

1 

6 

36 

= 11*0667-9*8178 


30 

94 

332 

= 1*2489. 

(7= V1-2489 = 1*117. 


S.vili. This method becomes laborious if the figures in the 
X column are large, since will be very much larger and the 
/(Jf *) column will require a good deal of calculation and addi¬ 
tion. It is usually easiest, especially if N is large and the range 
of the variable wide, to work from a grouped frequency distri¬ 
bution table. The formulae for the standard deviation in this 


case are: S(/a:2) 

N ( iV^ ) * 

Using formula (3), this may also be written 

N 


Also 


cr — <r^x(i). 


( 8 ) 


Ill these formulae ar^ is the standard deviation in working 
units. The method of calculation is an extension of that used 
in Example 3 and the whole process is shown below. 



3.ix] SCATTER OR DISPERSION 21 

Example 9. Calculate the standard deviation of the data in 


Table 11. 

X 

• 

/ * 

X 


fx^ 

190-193 

2 

-7 

-14 

98 

194-197 

4 

-6 

-24 

144 

198-201 

7 

-5 

-36 

175 

202-206 

12 

-4 

-48 

192 

206-209 

19 

-3 

-57 

171 

210-213 

24 

-2 

-48 

96 

214-217 

27 

-1 

-27 

27 

218-221 

35 

0 

^2^ 

0 

222-225 

26 

1 

26 

26 

226-229 

21 

2 

42 

84 

230-233 

18 

3 

54 

162 

234-237 

13 

4 

52 

208 

238-241 

6 

5 

30 

160 

242-245 

5 

6 

30 

180 

246-249 

2 

7 

14 

98 

250-253 

1 

8 

8 

64 


222 


256 

-253 

1875 


3 

The column is obtained by multiplying together the 
corresponding entries in the x and the fx columns. 

" 222 \ 222 / 

= 8-4457. 


(/f= ^8-4457 = 2-906. 

/. (T = 2-906x4 = 11-624. 


3 .ix. The coefficient of variation. If we wish to compare 
the scatter of different variables about their means, it is useful 
to* be able to express the scatter in some form which is not 
dependent on the absolute size of the variables. For instance, 
mice and men may be relatively equally variable in length, but 
this would not be revealed by stating the standard deviation 
in inches of a sample of each. Comparison may, however, be 
usefully made by calculating the coefficient of variation in each 



SCATTER OR DISPERSION 


22 


[3.ix 


case. This is readily obtained by expressing the standard 
deviation as a percentage ratio of the mean, i.e. 




lOOcri 


m 


(9) 


where m is the mean of the sample. 

The coefficient F is a ratio and is therefore independent 
of the units in which the mean and standard deviation are 
measured. 


Example 10. Compare the relative variability of the data in 
Examples 7 and 9. 

In Example 7, X = 6, 

cr = 3*16. 


In Example 9, 


r = = 52-67. 

D 

X = 219-56, 

or = 11-624. 

100 ^ 11 :^ 24 ^ 

219-56 


Hence the data in Example 7 are about 10 times as scattered 
relatively to their mean as are those in Example 9. 


EXERCISES ON CHAPTER III 

6. Find the upper and lower quartiles of the scores in each test 
from A to O inclusive for the whole 100 subjects and thence derive the 
semi-inter-quartilo range of each test. 

7. Calculate the mean deviation of the scores in each of the tests 
from A to G, using the medians given in the answers to Exercise 4, for 

(a) subjects 1-60, 

(b) subjects 51-100. 

8. Using formula (6), calculate the standard deviation of the 
scdres in 

* (a) test E, subjects 1-26, 

(6) test D, subjects 76-100. 



SCATTER OR DISPERSION 23 

9. Calculate the standard deviation of the following test scores, 
using ftJrmula (6); tests A, D,F and O: 

I 

(а) subjects 1-25, ^ 

(б) subjects 26-50, 

(c) subjects 61-76, 

(d) subjects 76-100. 

10. Using the grouped frequency distribution tables constructed in 
Exercise 3, calculate the standard deviation of the scores of each test 
from AtoO for the whole 100 subjects. Use formulae (7 A) and (8). 

From these standard deviations and the means given in the answers 
to Exercise 3(a), calculate the coefficient of variation of each test 
score. Use formula (9). 



Chapter IV 

THE NORMAL AND BINOMIAL 
DISTRIBUTIONS 

4.i. Many of the methods of statistics depend on the 
f assumption that the variables under consideration are nor- 
mally distributed and most of the methods cannot strictly be 
applied unless this assumption is justified. Biological variables 
are often in point of fact so distributed but it is not safe to 
assume that any particular distribution is normal without 
examination. 



Fig. 1. Histogram of frequency distribution. 


The meaning of a normal distribution may be most easily 
understood by considering certain graphs. Suppose we con¬ 
struct a graph of the data in Table II. Along the base line we 
first mark off points at equal intervals to represent the X- 
grdups. On the left we make a vertical scale to indicate fre¬ 
quencies. At each point marking an X-group we then draw 
vertical lines to represent frequencies: the requisite length of 


4.i-4.ii] NORMAL AND BINOMIAL DISTRIBUTIONS 25 

these lines is indicated by the frequency of a particular X- 
group and the frequency scale on the left. Finally we join these 
vertical lines together bjj horizontal lines and we have what is 
called a histogram of the frequency distribution of Table II. 
This is illustrated in Fig. 1. 

This figure gives a picture of the distribution of the variable 
X and it shows that most of the observations are clustered 
about the middle of the range and that there are relatively few 
observations at the extremes. The total frequency of the 
observations is given by the area between the base-line, the top 
of the histogram and the vertical lines at the boundaries if the 
interval between the groups is 1 unit. It will be seen that the 



Fig. 2. The normal distribution. 


top of the histogram is rather irregular. If, however, a very 
large number of observations had been made and the points 
along the base-line had been much more numerous, then the 
boundary of the histogram would have become a smooth, con¬ 
tinuous line, in shape something like the curve shown in Fig. 2. 

Fig. 2 shows the shape of a normal distribution, sometimes 
called a Gaussian distribution. This curve has various im¬ 
portant properties, some of which must be mentioned. 

4,ii. The normal distribution has an exact mathematical 
formula. It is a continuous curve and applies to continfious 
variables, such as height, where the difference between pne 
value of the variable and the next can be indefinitely small. 



26 NOBMAL AND BINOMIAL DISTBIBITTIONS [4.!!- 

Mathematically the curve stretches to infinity in both direc¬ 
tions, but practically only the portion drawn above is of 
importance. j 

The mean of th e ob servations is in the exact centre of the 
curve and there is the greatest number of observations at this 
point. Since the curve is symmetrical, the median and the 
mode coincide with the mean. 

The area between the curve and two uprights drawn at any 
points gives the firaction of the total number of observations 
between those points. In Fig. 3 uprights have been drawn at 
points corresponding to 1,2 and 3 times the standard deviation 
of the distribution on each side of the mean. 



The area between — cr and <r is 68 % of the total area. This 
means that in a normal distribution 68 % of the observations 
lie within a distance equal to the standard deviation on each 
side of the mean. Similarly, from — 2(r to 2o- includes 96 %, 
and from — 3tr to 3<r includes 99-7 % of the observations. 
Hence it is obvious that in a normal distribution practically 
all the observations lie within a range of 6 times the standard 
deviation. This provides a rough check on the size of a cal¬ 
culated standard deviation: if the number of observations is 
largki, the standard deviation should be approximately a sixth 
of t^e range. (Note then in Example 9 the standard deviation 
was 11*6 and the range was 63, nearly 6 times as great.) 





4.Ui] NORMAL AND BINOMIAL DISTRIBUTIONS 

Exactly half the observations are included in an area 
bounded by a distance of 0*67449(r on each side of the mean, 
shown by dotted lines in Fig. 3. This means that the chances 
are exactly equal that a single observation shall deviate from 
the mean by an amount greater or less than 0*674490*. Use is 
made of this fact in calculating probable errors, which will be 
explained later. 

Expressing the above measurements in a different way we 
may say that a value of X whose deviation from the mean, 
either positively or negatively, is greater than o* will occur 
roughly 1 in 3 times; a positive or negative deviation greater 
than 2o* will occur about 1 in 20 times, and greater than 3o* 
about 1 in 370 times. Tables have been calculated showing the 
probability of obtaining deviations of any size. Such a table 
is called a Table of Probability Integral and may be found in 
Tables for Statisticians and Biometricians (Ref. 3). 

On such facts is based the conception of statistical signi¬ 
ficance. The term ‘ significance ’ is used in statistics to indicate 
that the odds are heavy against the deviation from its expected 
value of a particular estimate, difference or coefficient occur¬ 
ring by chance as a result of random sampling. In practice 
odds of 19 to 1 against an occurrence by chance are usually 
taken as indicating the significance of that occurrence. This 
corresponds roughly to the odds of getting a deviation from 
the mean of a normal distribution greater than twice the 
standard deviation, either positively or negatively. (Some 
statisticians prefer heavier odds, such as 99 to 1, as their 
criterion of significance. This must to some extent depend on 
the nature of the variables, but in general a probability of 19 
to 1 against an occurrence is usually regarded as sufficient for 
significance.) 

Probability, which is symbolised as P, is usually expressed 
a& a decimal fraction, so that odds of 19 to 1 and 99 to 1 are 
written as P = 0-05 and P = 0-01 respectively. Writers some¬ 
times refer to these as the 5 % and 1 % levels of significance. 

# 

Testing the normality of a distribution. It is 
unsafe to regard any bell-shaped distribution as being neces- 



28 NOBMAL AND BINOMIAL DISTRIBUTIONS [4.iil 

sarily a normal distribution, and since so much of statistical 
method depends on normality it is important to be able to 
test any given distribution for nornlality. This may be done 
quite simply by mathematical means, although the process 
requires a good deal of arithmetic. 

Essentially, testing the normality of a distribution depends 
on the calculation of two constants, and which are 
derived from the first four moments about the mean of the dis¬ 
tribution, and two further quantities, yi and 72 , which are 
related to and p^ according to the following equations: 

Ti = ± 

7 ^ — 02 — 3. 

7 i is a measure of whether or not the distribution is sym¬ 
metrical. 72 measures departures of a symmetrical nature from 
normality. The use of these constants will be explained later. 

Now the first four moments about the mean of a frequency 
distribution are denoted by /^ 2 > /^ 4 , and these are 

calculated, with certain theoretical corrections, by a method 
which is simply an extension of that already used in calculating 
the standard deviation, as in Example 9. The student should 
now be familiar with the method of calculating 2 (/a?) and 
Two further columns have to be constructed, the totals 
of which wiU yield and S(/a;^). If these four totals are 

divided by N we get four quantities denoted by y[y Vg and 

The moments about the mean are then obtained from the 
equations: /ii = 0 

^ /l2 = t'2-f''l^ 

/t4 = vi - 4vi i»3 + Gvi® v'z - 3vi*. 

When the variate is continuous certain corrections have to be 
applied for grouping, and the equations then become: 

i“8 = »^3-3>’ivi + 2vi®, 

^4 = >'4 - M •'s + »'2 - 3K* - i/*a - 



4.ili] NOBMAL AND BINOMIAL DISTRIBUTIONS 29 


Having obtained these four moments, and are given by 
the formulae 

B 


A = 


* 


From these and 72 readily calculated. The standard 
errors'^ of and are yJi^jN) and ^(24/iV) respectively. This 
means that if the values of y^ and yg l^ss than tmce these 
standard errors, then the distribution is not significantly dif¬ 
ferent from the normal form: if they are greater than twice 
their standard errors, the distribution is not normal. 

This mass of symbolism will probably be alarming to the 
elementary student, but the actual process of calculation in¬ 
volves arithmetic only and is showm in the following example. 


Example 11. Test the normality of the distribution given in 
the first two columns of the table below: 


/ 

X 



fx^ 


fx^ 


1 

-5 


— 5 

25 

- 

-125 

625 

2 

-4 


- 8 

32 


-128 

512 

5 

-3 


-15 

45 


-135 

405 

10 

-2 


-20 

40 


- 80 

160 

20 

-1 


-20 

20 


- 20 

20 

50 

0 


-68 

0 


-488 

0 

22 

1 


22 

22 


~22 

22 

11 

2 


22 

44 


88 

176 

5 

3 


15 

45 


135 

405 

3 

4 


12 

48 


192 

768 

1 

6 


5 

25 


125 

625 




76 

346 


562 

3718 




-68 



-488 





8 



74 




Vi 

= 8/130 

= 

0-0616, 



• 


K 

= 346/130 

= 

2-6616, 





v'z 

= 74/130 

= 

0-6692, 





v\ 

= 3718/130 

= 28-6000. 


• 

* The 

meaning 

of 

the term ‘standard error’ is 

explained *in 


Chapter v. 



30 NORMAL AND BINOMIAL DISTRIBUTIONS [4.Ui- 


Hence 

/t, = 2-6616-0-0616*-0-0833 = 2;6744, 

/*8 = 0-6692 - (3 X 2-6616 x 0-0616)ff (2 x 0-0616*) = 0-0787, 
/t, = 28-6- (4 X 0-0616 x 0-6692) + (6 x 0-0615* x 2-6616) 

- (3 X 0-0615*) - i(2-5744) - 0-0125 = 27-2207. 


0-0787* 

• * “ 2-6744* “ 


0-000363, 


27-2207 
~ 2-5744* 


4-1072; 


= 0-0191 ( 7 i has the same sign as 

S.e. = /—= 0-216; 

V 130 

ya = 1-1072, 

/ 24 

S.e. = /— = 0-430. 

V 130 

It will be seen that is considerably less than twice its 
standard error, hence the distribution is symmetrical, y^, how¬ 
ever, is more than twice its standard error, so that the distri¬ 
bution departs from normality in a symmetrical manner, y^ is 
said to measure kurtosis. Curves which are flat-topped and 
short-tailed compared with the normal curve are called platy- 
kurtic: for these is less than 3. Curves which are sharply 
peaked and long-tailed, and for which Is greater than 3, are 
called leptokurtic. In the above example, /?2 is greater than 3, 
so that the distribution is leptokurtic and not normal. The 
student should draw a histogram of this distribution and note 
the peaked shape of it. 

4.iv. The above is not the only method of testing the 
normality of a distribution. An alternative method i^ to flt a 
normal curve to a frequency distribution and test the goodness 
of flt by the x* method. (See Section 9.vii.) 

4.V. The binomial distribution and the calculation of 
probabiiities. In some oases the signiflcanoe of observed 
results may be tested by use of the binomial distribution. 
This distribution is well known in algebra and consists of the 



4.V.] NORMAL~AND BINOMIAL DISTRIBUTIONS 31 

expansion of the expression N{p+q)”'. The full expansion of 
this general expression, written symbolically, is 

N{p + q)”' = N |p* + np”'~^q + p'^-^q^ 


+ 


n{n-\){n-2) 

3! ^ ^ 





( 10 ) 


Usually in probability problems, p is the probability of an 
event happening, expressed as a fraction, and q the proba¬ 
bility of its not happening, so that p-\-q= 1. N is the number 
of trials made and n is the number of events in each trial. 
For example, suppose we toss 4 pennies 32 times. Tossing 
1 penny is an event; tossing 4 at once is a trial of 4 events and 
in our experiment we make 32 trials. We wish to know the 
probabilities of getting 4 heads, 3 heads and 1 tail, etc. Since 
the chance of getting either a head or a tail in one event is J, 
the probabilities of getting the different results will be given 
by the successive terms in the expansion of 32(^ -i-Sub¬ 
stituting these figures in the general expansion (10) we obtain 

32(i + = 32 j( 4( (i) + ^ (i)a (i)* 


Evaluating the successive terms in this we have, calling 


heads H and tails T, 

chances of getting 4ff and OT are (J)* x 32 " =2 

chances of getting ZH and IT are 4(|)® (J) x 32 =8 

chances of getting 2H and 2T are (|)* x 32 =12 

4 

chances of getting \H and ZT are (i)3 x 32 = 8 

chances of getting OH and 4T are (i)* x 32 =2 

Total 32 


If we wished to know the odds of getting at least 2H showing 
in each trial, we could find them by summing all the terms in¬ 
volving two or more head^ i.e. the first three terms. Hence out 



32 NORMAL AND BINOMIAL DISTRIBUTIONS [4.V- 

of 32 trials we should expect to see two or more.heads in 
24-8+12 = 22 trials. The process is similar in dice-throwing. 
Suppose, for instance, we throw 3'dice 60 times; how often 
should we expect to get 3 sixes? Tlie probability p of getting 
a six in a single event is so that the chances of getting 3 
sixes in 60 trials of 3 events is given by the first term in the 
expansion of 60(^ + This term is 60(i)® = We should 
not therefore expect to get 3 sixes in as few as 60 trials, in fact 
we should need to make at least 216 throws before we could 
expect this to occur once. This does not mean, of course, that 
3 sixes would be thrown at the 216th throw—^they might be 
thrown at the first attempt—but it means that the chances of 
throwing 3 sixes are 1 in 216, so that in a very large number of 
trials only 1 in 216 would yield this result. 

A more complicated problem would arise if we asked how 
often we could expect a total score of 15 or more if we threw 
3 dice 60 times. First we should need to know how many 
dilfferent scores are possible. The answer to this is obviously 
16 since the total score may be any number from 3 to 18 in¬ 
clusive. However, all these sixteen scores are not equally 
likely. For instance, a score of 18 can occur in one way only, 
i.e. 3 sixes, but a score of 17 can occur in three ways, i.e. 6, 6, 5 
or 6, 5, 6 or 5, 6, 6. A score of 16 may occur in 6 ways, viz. 
6, 6, 4: 6, 4, 6: 4, 6, 6: 6, 5, 5: 5, 6, 5 or 5,6, 6. Working out all 
the possibilities in this way we arrive at the following table: 



No. of 


No. of 

Score 

possible ways 

Score 

possible ways 

18 

1 

10 

27 

17 

3 

9 

25 

16 

6 

8 

21 

16 

10 

7 

16 

14 

15 

6 

10 

13 

21 

5 

6 

12 

25 

4 

3 

11 

27 

3 

1 

Total 216 


<By addition we see that there are 216 ways of getting dif¬ 
ferent scores. As a check on this total, note that there are 6 
ways in which the first die may fall, and for each of these there 



4.Vi] NORMAL AND BINOMIAL DISTRIBUTIONS 33 

are 6 wfiy|p in which the second die may fall, and for each com¬ 
bination of the first and second there are 6 ways in which the 
third may fall. The total number of possible combinations of 
the three, therefore, is 6 x 6 x 6 = 216. Of these 216 possible 
score combinations, l + 3 + 6-fl0 = 20 will give a score of 16 
or more, hence the number of times we should expect a score 
of 16 or more in 60 throws of 3 dice is 

60 X ~ = 5-6. 

216 

Although this illustration does not directly make use of the 
binomial expansion, it gives an example of the calculation of 
probabilities—a process which is not always obvious. 

As a further example, using the binomial expansion, take 
the case of a football coupon where the results of 14 matches 
have to be forecast. The result may be a win, a draw or a loss 
in any one match. Assuming that the chances of a win, a draw 
or a loss are equal (this is probably not true in practice but it 
is assumed here for the sake of illustration), the probability of 
getting any single forecast correct by chance is 1 in 3. What is 
the probability of getting all 14 forecasts correct and all 14 
wrong? The answer to this is given by the first and last terms 
of the expansion of the binomial + The first term is 
(J)i4 = 1/4,782,969. The last term is 

(§)i4 = 16,384/4,782,969 = 1/342-56. 

Hence to make sure of one correct forecast in all 14 cases one 
would have to fill in over 4f million coupons, each with a 
different forecast, whereas on the average 1 form in every 343 
would be completely wrong. 

4.vi. Now the mean of a binomial distribution (p + q)^ is 
and the standard deviation is ^j{npq). Suppose we toss a 
penny 100 times and count a head as a success. If the penny is 
unbiased, the mean number ofsuccesses would be 100 X J = 60. 
If, however, in a particular trial we obtained 62 heads, could 
we regard the penny as being biased? The standard deviatfbn 
is -^(100 X J X i) = 6. The observed value of 62 successes diflFers 
from the expected mean 1^ 12, and 12 is 2-4 times the standard 


CSC 


3 



34 NORMAL AND BINOMIAL DISTRIBUTIONS [4.Vi 

deviation. Reference to the table of the probability integral 
shows that the probability of getting a deviation from the 
mean of 2*4(T or greater is just less than 0*02, so that we should 
strongly suspect the penny of being biased. Note that in this 
example we should have obtained the same result if we had 
had 38 observed successes, since the probability of a deviation 
of 2‘4or or more on either side of the mean is just less than 0*02. 
If, however, the penny were tossed in an electrical field which 
we had reason to suspect would cause more heads to appear, 
we should have to ask what is the probability of obtaining the 
same deviation from the mean in a positive direction only. 
This would be half tha,t given in the table, i.e. with 62 heads it 
would be just less than 0*01 (0*0082 to be exact). Hence we 
should conclude that the electrical field was definitely having 
the suspected effect. 

The question of when to use P and when to use |P for testing 
significance is apt to be a difficult one. The answer depends 
entirely upon the hypothesis we are testing. For example, 
a test is given to 100 subjects twice and it is found that 59 
subjects get a better score the second time and 41 get a worse. 
With these data we may test two hypotheses. 

(а) Assuming that it is equally likely for a subject to get a 
better or a worse score the second time, the chance of a subject 
getting a better score is We may then test the hypothesis 
that the observed distribution of better and worse scores does 
not differ significantly from chance. As in the previous ex¬ 
ample, we should expect 60 subjects to improve by chance with 
a standard deviation of 6. The observed deviation from the 
mean is 9 which is 1*8 times cr. From the tables it may be seen 
that for this deviation P = 0*071. Hence we conclude that the 
observed distribution does not differ significantly from chance. 

(б) We may, however, decide on psychological grounds that 
doing a test twice is likely to have a disturbing infiuence on 
the scores, and we proceed to test the hypothesis that twice- 
testing will cause an improvement in scores. Here we are 
dealing with deviations from the mean in a positive direction 
only, so that the probability is \P = 0*036. From this we 
should conclude that scoring on tho test has improved signi- 



4*Vi] NORMAL AND BINOMIAL DISTRIBUTIONS 36 

ficantljr tn a second testing. Note that this shows only that 
a greater number of individuals than would be expected by 
chance improve on a second testing. The amount of improve¬ 
ment, as measured by the number of marks scored, might still 
not be significant. To decide this further point it would be 
necessary to average the 100 differences between the first and 
second scores and apply the t test to see if this mean differed 
significantly from zero. 

As a note of warning, it must be emphasised that the grounds 
for using \P as a measure of significance must be exceedingly 
firm and justified before making the actual test. If there is the 
slightest doubt, it is far safer in the interests of truth to use 
P as a criterion. This is a matter of statistical ethics. The 
honest inquirer makes up his mind before analysing his data 
what can reasonably be expected from them, and for convic¬ 
tion about the truth of his statistical results he must feel that 
they are not based on any specious argument, however 
plausible. 

EXERCISES ON CHAPTER IV 

11. Using the method of Section 4.iii (Example 11), test the 
normality of the distribution of the scores in test A for the whole 
100 subjects. Make use of the grouped frequency distribution table 
constructed in Exercise 3. 

12. Repeat Exorcise 11 for the whole 100 scores in test C. 

13. Five pennies are tossed 320 times. How often would you expect 
at least 4 heads ? 

14. Sixty-four people are asked a problem question, the answer to 
which can be only ‘Yes’ or ‘No’; 38 people answer ‘Yes’, which is 
coiTcct, and 26 answer ‘No’. What are the chances of obtaining this 
result if all the answers were guessing ? 


3-2 



Chapter V * 

SIGNIFICANCE OF MEAN AND DIFFERENCE 
BETWEEN MEANS 


5 A . The observations recorded in a single biological experi¬ 
ment are but one sample drawn from the whole population of 
possible samples. If a second experiment is made it is unlikely 
that the mean of the observations in this case will be identical 
with that of the first experiment. In short, it will be found that 
a large number of experiments will yield many different values 
of the mean, each one departing more or less from the true 
pean of the whole population. 

If the standard deviation of the whole population is cr^ and 
we take a large number of random samples of n observations, 
then the means of the sample will be distributed with a 
standard deviation If the population is normally dis¬ 

tributed, the means also will be normally distributed. Even if 
the distribution of the population is not normal, the distribu¬ 
tion of the means of samples still tends to be normal if the size 
of the samples is sufficiently large, but in the case of small 
samples the distribution of the means is not normal. 

Usually we do not know the standard deviation of the whole 
population but have to take the standard deviation of an 
observed sample as an estimate of it. In this case we estimate 
the standard deviation of the sampling distribution from the 
number and standard deviation of a single sample. This 
estimated value is called the standard error of the mean, i.e. 


S.e. of mean = 


cr 


( 11 ) 


where cr is the standard deviation of the sample and N the 
number of observations in it. 

(There was formerly a practice, which has little to recom¬ 
mend it, of calculating the probable error of the mean. This is 
giVbn by the formula 

0-67449er 


P.e. of mean = 



37 


5.i-5.ii] DIFFERENCE BETWEEN MEANS 

It may l)e^oted that three times the probable error is roughly 
equal to twice the standard error.) 

5.ii. Significance of a single mean. If we have only a 
single sample to give estimates of X and a, then the distribu¬ 
tion of Z/cr will not be normal. However, the correct distribu¬ 
tion has been calculated and tables have been made enabling 
us to make use of the data from a single sample; an extract 
from these tables will be given later. 

When calculating the standard deviation for the purpose of 
examining the significance of the mean, S{X — X)^ should be 
divided hy (N—l) instead of by N. The reason for this is that 
we are making an estimate of the standard deviation in the 
whole population and this is best obtained by dividing the 
sum of the squared deviations from the mean by the number 
of degrees of freedom available. This number is one less than 
the number in the sample. 

Table III. Values of t corresponding to 
a probability P = 0*05 


n 

t 

n 

t 

n 

t 

1 

12-706 

11 

2-20J 

21 

2-080 

2 

4-303 

12 

2-179 

22 

2-074 

3 

3-182 

13 

2-160 

23 

2-069 

4 

2-776 

14 

2-145 

24 

2-064 

5 

2-571 

15 

2-131 

25 

2-060 

6 

2-447 

16 

2-120 

26 

2-066 

7 

2-366 

17 

2-110 

27 

2-052 

8 

2-306 

18 

2-101 

28 

2-048 

9 

2-262 

^ 19 

2-093 

29 

2-046 

10 

2-228 

20 

For n = CO, 

2-086 

t = 1-96. 

30 

2-042 


The above table is an extract from a full table given by R. A. Fisher 
(Rtjf. 2), Table IV; or in the Fisher and Yates Tables (Ref. 6), Table III. 

Having obtained the mean and standard deviation we need 
to calculate a statistic known as t\ this is essentially the ratio 
of the mean to its standard error, (t may also be the ratio t)f 
a difference between means to its standard error: see below, 
Section 5.iii (6).) • 



38 


SIGNIFICANCE OF MEAN AND 


For a single mean !; ^ 

In Table III are given values of t corresponding to different 
values of n, the number of degrees of freedom, i.e. n = — 1 in 
this case. The odds against values of t as big as.or bigger than 
these occurring by chance are 19:1, i.e. the probability, usually 
denoted by P, of their occurring by chance is 0-05. If the 
calculated value of t is greater than that given in the table for 
the appropriate value of n, then the mean is significantly 
different from zero. 

Example 12. Ten schoolchildren were given an arithmetic 
test. They were then given a month’s further tuition and a 
second test of equal difficulty was held at the end of it. Their 
marks in these two tests are given below. 


Scholar 

Test 1 

Test 2 

1 

20 

22 

2 

18 

19 

3 

19 

17 

4 

22 

18 

5 

17 

21 

6 

20 

23 

7 

19 

19 

8 

16 

20 

9 

21 

22 

10 

19 

20 


Do these marks give evidence that the scholars had benefited 
by the extra tuition? 

This problem resolves itself into the question. Is the mean 
of the differences between successive marks significantly 
different from zero? 

We need first to construct a third column giving the values 
of (Test 2 — Test 1); this will be our X column. Add this column 
to get S(X) and obtain X by dividing 8(X) by N (formula (1)). 
Make a fourth column giving (X-^X) and a fifth giving 
(X — XY, Add this fifth column to obtain 8{X — X)^. This 
gives us all the necessary data for calculating t. The last 



5.1il] DIFFERENCE BETWEEN MEANS 39 

three cflilmns and the remainder of the working are shown 
below. 


Test 2 —Test 1 


« 


X 

X^X 

(X-X)2 

S(X) = 10, 

2 

1 

1 

^ = 1,. 

1 

0 

0 

S(X-X)^ _ 58 

-2 

-3 

9 

N-l "" 9 * 

-4 

-5 

25 


4 

3 

9 

„= /?. 

3 

2 

4 

/sj 9 

0 

-1 

1 

Hence, by formula (12), 

4 

3 

9 

1 X VlO /90 

1 

0 

0 

* - /58 “ V 58 

1 

0 

0 


10 


58 

= 1-25. 


Reference to Table III shows that for n = 9, ^ = 2-262. Our 
calculated value is less than this, hence the mean of X is not 
significantly different from zero, and the marks are insufficient 
to prove the benefit of the extra tuition. 

The above method of calculation is not convenient if the 
mean is not a whole number. We may, therefore, adopt an 
alternative method of calculation, making use of the identity 

S{X-Xr = S{X^)-j^[8(X)f. 

In the above example all that is needed is to construct and 
add a column giving the squares of all the observations, i.e. a 
column of yielding a total of 8(X^), In Example 12, 

S(X^) = 4+1 + 4+16+16 + 9 + 04-16+1 + 1 = 68. 

Hence S(X-X)^ = 68-^^(10)2 

= 68-10 

. = 58. 

From this point the calculation of t is identical with that 
given in Example 12. 

S.lii. Significance of the difference between meaiA. 

An important and often occurring problem is to determine ^ 
whether there is a real difference between two observational 



40 


SIGNIFICANCE OF MEAN AND 




means or not. In statistical language this problem be ex¬ 
pressed in the words, Is the difference between the means such 
that they might have been drawn from the same population 
by random sampling or are they drawn from two different 
populations? There are two methods of dealing with this 
question appropriate to the cases in which the samples are 
large or in which they are small. 

j (a) Large samples. Ifthe numbers ofobservations in the two 
/samples are large, say, at least 60, the question may be settled 
by calculating the standard error of the difference between the 
means. If the means are ^ and their standard deviations 
cr^ and 0*2 are the numbers in the samples and -^2 respec¬ 
tively, then the standard error of the difference between the 
means is given by the formula 


S.e. of difference = 



‘ (13) 


This formula applies only if the two variables are independent 
or uncorrelated (see Chapter vi). 

If the two variables are correlated 


(S.e. of difference)- - 

where r is the coeflScient of correlation. 

As before, the probable error of the difference would be 

P.e. of difference = 0-67449 J • (13 A) 

Ifthe difference between the two means is greater than twice 
its standard error (or three times its probable error), then the 
means are significantly different, i.e. it is unlikely that they 
would be drawn from the same population by random samp¬ 
ling, the odds against being at least 19 to 1. 

Example 13. A group of boys and a group of girls were given 
an intelligence test. The mean scores, standard deviations and 
numbers in the groups were as follows: 




Boys 

Girls 


Mean 

, 124 

121 


<T 

12 

10 


N 

72 « 

60 



DIFFERENCE BETWEEN MEANS 


41 


sail] 

Was mean test score of the boys significantly greater than 
that of the girls? 

Difference between the means = 124— 121 = 3, 


Hence 


S.e. of difference = J "I" j 

Difference ^ j r 
S.e. of difference 2 ~ 


In this experiment, therefore, the mean intelligence test score 
of the boys was not significantly greater than that of the girls. 

(6) Small samples. When we wish to compare the means of 
small samples of less than 50 observations, the use of formula 
(13) is no longer a sufRciently strict test and we have to apply 
the t test of significance. The test is essentially similar to that 
in the previous section but corrections have to be made to 
allow for sampling errors, which are more important in small 
samples. 

Suppose we have readings of a variable and readings 
of a variable and we wish to see whether or not the means 
x^ and ^2 differ significantly from one another (or, in other 
words, we wish to see what is the probability that the two 
samples could be drawn from the same population). To apply 
the t test in this case we need to know six quantities, N 2 , 
8(x^)y S(X 2 ), S(xl) and S{xl). 


As usual, Xi = and X 2 = • 

JSli iVg 

The variance of the combined observations, which we shall 
denote by (r|, is 

^2 _S(Xi-X^)^ + 8(X2-X2)^ ^ 

and the standard error of the difference is ^ 


S.e. of difference = cr^ 





(The reader should compare this expression with formula 41 
in the case where (Tj = (Tg.) 




42 


SIGNIFIOANOB OF MEAN AND 


_ [5.Ui 

In this case t is the ratio of the difference between tUp ^eans 
to the standard error of the difference, i.e. 

; I ^1 — ^2 I « 

We must now express cr^ in terms of the six quantities with 
which we started. We have 


t=- 


(14) 


o'l = 


-S(xf)- 


[-^(* 1 )]* , [S(x,)y 




N^+N^-2 


Written in full, therefore, 
S.e. of difference 


The part dealing with Nj^ and under the first square-root 
sign reduces to / / N -i-N \ 


for which we may write y/N'. For the convenience of students, 
values of y](llN') for and N 2 between 10 and 50 have been 
tabulated in Appendix A. 

Using therefore the six quantities with which we started, we 
may therefore express t in the following form: 



(14A) 


The use of this formula is shown in the example below. * 

In this case, for using Table III to find the probability of 
obtaining observed values of t, the value of n is 
^ n = Ni+N^-2, 

^ Note. The t test is applicable only when the variates are 
normally distributed and are not corielated. 




5.iii] DIFFERENCE BETWEEN MEANS 43 

ExqnSple 14. The span of apprehension of two small groups 
of children, one from the lowest class in a school and the other 
from the top class, was tested by seeing how many digits they 
could repeat backwards l6rom memory after hearing them once 
repeated forwards. The numbers of digits correctly repeated in 
the two oases were s\g follows: 

Group A3664 3 34 

Group B6896 12 976 


Is there any real difference in the span of apprehension of the 
two groups! 

We have the following data: 


For group A, = 7, For group B, = 8, 


Hence 


S{X^) = 28, 


8(X^) = 62, 


,8(Zf) = 120, -S(Zi) = 516, 

Xi = 4-00. Zj = 7-75. 

Z 2 -Z 1 = 7-75-4-00 = 3-75. 


Now 

Also 


11 77x8x13 

= V.= 

'{Xi] 


S(Xl) - + 8(Xl) - 


= 120-112 + 516-480-5 


= 43-5. 


Hence, by formula (14 A), 

3-75x6-97 26-1375 

^43-5 “ 6-595 

= 3-96. 

For using Table III in this case, the value of n for entry is ^ 
74-8-2 = 13. The value of t in Table III corresponding to 
n = 13 is 2‘160. Our calculated value of t is considerably 
greater than this, hence the difference between the meaifi is 
significant and we conclude that the span of apprehension ofr 
group B was significantly greater than that of group A. 



44 SIGNinOAKCB OF MEAN AND [S.IU- 

If the actual observations in the two groups for conlp 9 .rison 
are large, the arithmetic in calculating the standard*error of 
the difference between the means may be fairly arduous. In 
such cases the work may be lightened by working from arbi¬ 
trary origins, provided the range of the readings is not too 
great. To do this, we choose a convenient arbitrary origin, or 
starting point, near the middle of the range of observations. 
Calling this origin A, we construct a column headed (X —.4) 
by subtracting A from each reading in turn. Care must be 
taken of the signs since some of the entries will be negative. 
Summing this column gives us 8{X — A). Each entry in this 
column is then squared and the resulting (X — A)^ column 
summed to obtain /S(Z — 

We then made use of the following identities: 

8(X) = 8(X-A) + NA, 

8{X^) = 8(X-A)^-NA^ + 2A.8(X), 

8(X-X)^ = 

f 

The method of extending this process to two groups is 
illustrated in the following example. 

Example 16. Is the difference between the mean reaction 
times of the following two groups significant? 

Group A 98 97 104 106 100 111 99 99 101 102 

Groups 100 94 93 99 101 87 86 91 86 86 89 

The range of observations in group A is from 97 to 111, so we 
may choose 106 as a convenient origin, i.e. = 106. Call the 
readings in group A, We then construct a column headed 
X^ — Ai and sum it (see table on p. 46). Each reading in this 
column is then squared, giving a column headed (Xi— 
which we also sum. The process is repeated with group B 
taking A^ = 90. 

Substituting these totals in the identities given in the table 
we have for group A, 

* 8{Xj) = 8iXi-Ai)+Nj^Ai 

. =-33 + (10x 106) 

= 1017, 



5.iv] DIPFEEBNOE BETWEEN MEANS 46 

whence | = 1017/10 = 101’7, 

S{X, -= 8{X^ - 1 - A,)f 

= 273-33*/10 
= 273-108-9 = 164-1. 

Similarly for group B, 

Z* = 91-9091, 

SiX^-X^)^ = 354-9091. 





1 

to 

- 7 

49 

10 

100 

- 8 

64 

4 

16 

- 1 

1 

3 

9 

1 • 

1 

9 

81 

- 6 

25 

11 

121 

6 

36 

- 3 

9 

- 6 

36 

- 4 

16 

- 6 

36 

1 

1 

- 4 

16 

- 6 

25 

- 3 

9 

- 4 

16 



- 1 

1 

-33 

273 

21 

395 


From Appendix A, for = 10 and 11, we find that 
J = 9*08. Substituting in formula (14 A) we obtain 

(101-7-91-9091)9-98 97-7132 

7(164-1 + 354-9091) “ 22-78 ~ ’ ’ 

From Table III with n = Ni + N^ —2 = 19, we find a critical 
value for t of 2-093. Our calculated value is much larger than 
this, hence the difference between the means of the two groups 
may be taken as significant. 

5.iv. Variancedifferences: thevarianceratio. Apoint 
that is frequently lost sight of is that when two means are 
judged to be significantly different by the t test, this result 
may be due not to the fact that the two groups of observations 
are drawn from populations with different means, but to tlie 
fact that the variances of the two groups are significantly 
different. This may, and should, be examined by calculating 



46 SIGNIFICANCE OF MEAN AND [5.iV- 

the variance ratio of the two groups and testing its significance 
by reference to Fisher and Yates’s Table V. The variance 
ratio (v.B.) is simply obtained by working out the variance of 
each group and dividing the larger by the smaller. The 
number of degrees of freedom available for this is .Y — 1 for 
each group, but note that in referring to Fisher and Yates’s 
table, is the number of degrees of freedom in the larger 
variance. 

Applying this to the Example 15 of the previous section, 
we have for the variance of group A, 164*1/9 = 18*23 and for 
the variance of group B, 354*9091/10 = 35*49. 

Hence the v.r. = 35*49/18*23 = 1*95. 

Referring to Fisher and Yates’s table, we find that for 

= 10 (since the variance of group B is the larger) and 
ng = 9, the critical value of the v.r. is about 3*1 (its exact 
value could be found by interpolation). The calculated v.r. 
is much smaller than this, so we conclude that the two vari¬ 
ances are not significantly different. 

5.V. Significance of difference between proportions. 

It frequently happens that no exact measurement of a quality 
is possible but the presence or absence of the quality may be 
observed. For example, we may note whether or not a person 
has blue eyes without attempting to measure intensity of 
blueness. In the same way we may note what proportion of 
a group of objects have a certain quality and we may wish on 
occasion to examine whether the proportion of one group 
possessing a quality is really different from the proportion of 
another group possessing that same quality. As in the case of 
the difference between means, the problem may be expressed 
in the form: What are the odds against obtaining by chance a 
difference of proportion as big as the one observed in a homo¬ 
geneous population? Now in the case of investigating the 
significance of the difference between means, the probability 
of obtaining differences of various sizes depends on the dis¬ 
tribution of the variables, assumed in this instance to be a 
normal distribution. In the case of proportions the proba¬ 
bilities depend upon the binomial distribution (see Chapter iv). 



DIFFERENCE BETWEEN MEANS 


47 


5-v] 

The pj^,rilmeters of this type of distribution are different 
from thcibe of a normal distribution but the method of 
examining the significance of differences of proportions is 
similar. 

Let N be the number of individuals in a group and p be 
the proportion of them possessing some particular quality. 
Usually p is expressed as a decimal fraction, and we call the 
proportion not possessing the quality q, so then q= 1 — p. The 

standard error of a single proportion is If we have two 

groups of Ni and N 2 individuals respectively and if the pro¬ 
portions of them possessing the same quality are p^ and pg* 
we need an expression giving us the standard error of the 
difference between Pi and pg. 

First we have to astimate the proportion of individuals in 
the combined groups who possess the particular quality. If 

p is this proportion, then p = — . As before,g= l—p. 

We have, then 


Standard error of difference between proportions 


(15) 


The differences between proportions drawn from a homo¬ 
geneous population are not distributed in the same way as 
the differences between means drawn from a normal popula¬ 
tion. It is therefore safer to require an observed difference 
between proportions to be three times its standard error before 
assuming significance. 

Note, In examining the significance of the difference be¬ 
tween proportions, the actually observed proportions must 
be used. If the proportions are expressed as percentages, the 
"groups concerned are assumed to be of equal size, and this will 
lead to error except when the groups are actually equal. 


Example 16. In a group of 50 boys, 24 are over 4 ft. in height, 
whilst 18 out of a group of 60 girls are over 4 ft. Can the 
difference between the proportions of taller children in the« 
two groups be regarded as real? 



48 


SIOKIFIOANOE OF MEAN AND 


[5.V. 


We have I ^ 

JVj = 60, 2)1 = 24/60 = 0-48, ^ 

= 60, P2 = 18/?0 = 0‘30. 

The combined proportion, 

(0-48x60)+ (0-30x60) 24+18 ^ 

P -M+60- 

gf= 1-0-38 = 0-62. 

Hence s.B. of difference of proportions 

.y(0.38x0.62(l + l)) 

= V(0*2356 X 0-03667) 

= VO-008640 
= 0-093. 

The difference 

Pi^P2 = 0-48 — 0-30 = 0-18. 

The difference is not quite twice its standard error and we 
therefore conclude that the dijBference between the proportions 
of taller children in the two groups cannot be regarded as 
a real one. 


EXERCISES ON CHAPTER V 

16. Subtract the score in test D from the score in test C for each 
of the firat 26 subjects. Calculate the mean difference between these 
scores and determine whether this mean is significantly different from 
zero or not. Use formula (12). 

16. Using formula (13) and the standard deviations given in the 
answers to Exercise 10, calculate the standard error of the difference 
between the mean scores of the whole 100 subjects in the following 
tests: 

(a) A and Q, (h) C and D, (c) C and F, 

Id) D and F, (e) D and O, If) F and G, 

Thence determine which pairs of means are significantly different. 
Use the means given in the answer to Exercise 3 (a). 



DIFFERENCE BETWEEN MEANS 


49 


17. T^sijig the method of Section 5.iii(6), formula (14 A), examine 
the signific^ce of the difference between the following pairs of means: 

(а) Tost A : mean of subjects 1-25 and 26-60. 

(б) Test Gi mean of subjects 1-26 and 26-50. 

(c) Test Di mean of subjects 1-26 and 26-60. 

(d) Test O: mean of subjects 1-25 and 26—60. 

(e) Mean of subjects 1-25 in test C and mean of subjects, 1-26 in 
test O, 

(/) Mean of subjects 61-76 in test C and mean of subjects 61-75 
in test G. 

(g) Mean of subjects 51-75 in test A and mean of subjects 61-76 
in test D. 

(h) Mean of subjects 76-100 in test F and mean of subjects 76- 
100 in test Q, 

18. Calculate the variance ratio between the two groups in Example 
14. Does this invalidate the conclusion that the means of the two groups 
are significantly different? 

19. In a school of 320 boys in a certain town, 46 % were absent 
through influenza in one month. In the same month in another town, 
39 boys from a school of 150 were absent for the same reason. From 
these samples, would you regard the prevalence of influenza as being 
equal in the two towns ? 


CSC 


4 



Chapter yi 

CORRELATION 

6,i. It frequently happens in experimental work that we 
wish to know the association between two variables, that is, 
to know to what extent one variable is related to the other. 
There are various methods of measuring such association, 
dependent on the nature of the variables, their types of dis¬ 
tribution, etc. When the two variables are numerical and 
normally distributed, the association, or correlation, between 
them may best be measured by a method known as the 
produM-moment method. This is by far the most useful and 
theoretically satisfactory method of measuring correlation 
and much advanced statistical work is based upon product- 
moment correlation. It will therefore be considered first. 

6 Ai . In describing methods of correlation the two variables 
will be called X and 7. These variables will have means X 
and 7 and standard deviations and cTy, Since the various 
values of the variables will always be considered in pairs, an 
X with a 7, there will be the same number of X’s and 7’s 
in any particular case, i.e. N. In terms of these statistics a 
quantity known as the coefficient of correlation may be cal¬ 
culated: this coefficient is denoted by r. If there is complete 
positive correlation between X and 7, r has the value 1; if 
there is complete negative correlation it has the value — 1, and 
incomplete correlation gives decimal values for r between 1 and 
— 1. If there is no relation at all between the variables, r is 0. 

The meaning of the above statements may be illustrated in 
this way. Suppose the heights and weights of N people were 
measured: these would be our X and 7 and there would be one 
value of X and one value of 7 relating to each person. Now 
sd^Dpose the tallest person was also the heaviest, the second 
vtaJJest the second heaviest and so on, until we reached the 
shortest person who was also the lightest: in this case there 



CORRELATION 


51 


would,bc5 complete positive association between X and Y and 
r would be 1 (provided the relation between X and Y was 
exactly linear and could be expressed by the equation 
y = a + bX). If, on the other hand, the group of persons was 
so peculiar that the tallest person was also the lightest, the 
second tallest the second lightest and so on to the shortest, 
who would be the heaviest in this case, then there would be 
complete negative correlation between height and weight, and 
r would be — 1 (with the same proviso as before). Complete 
correlation is very rare. Usually there is a general but not 
complete agreement between two variables, so that r is 
fractional. 


The-general formula for r is quite simple. If (JT — Z) 
and (F — y) are the deviations of corresponding values of X and 
y from their means (i.e. the pair of values corresponding to one 
person), then these two deviations may be multiplied together 
to give the product (X — X) (F — F). If we add together aU 
such products for the N persons, we obtain S(X — X) (F — F). 
The coefficient of correlation is then given by the formula 


^(X-x)(F-y) 

Ncr^CTy 


( 16 ) 


The application of the name ‘ product-moment ’ to this method 
may now be appreciated. The average deviation of any value 
of X from the mean may be described as the first moment about 
the mean, and similarly for the F’s. The mean product of such 
deviations is similarly called a ‘ product-momentThe expres¬ 
sion S{X — X) (F — F)/iV is called the co-variance, 

, Crude test scores may be transformed into standard scores 
by expressing them as deviations from their mean in terms of 
^ their standard deviation. Thus a crude score of X yields a 
* * X — X 

standard score of-and a crude score of F yields a stan— 

F- F 

dard score of-. .Call these X^ and Y^ respectively. Thpn 


r = 


N ‘ 



52 


COBBEIiATION 


Fonnula (16) is the simple theoretical form. In prabti^oe con¬ 
siderations of ease in calculation necessitate modifications of 
this and we shall now consider some of them. 


'6.1v. Product-moment correlation when N is small. 

If we have a small number of X’a and F’s, say less than 80, the 
calculation of r would be performed as follows. Write down the 
values of X and Y in two parallel columns in their pairs, i.e. so 
that the pair of readings in each horizontal row belongs to the 
same person. Next calculate two more columns, headed Z® 
and r®, by squaring the terms in the first two columns. The 
totals of these four columns will give us 8(X), S{ F), 8{X^) and 
8{ F®), and these data will enable us to calculate the means 
and standard deviations of X and F by formulae (1) and 
(SB). 

Now instead of subtracting the mean X from each value of 
X and F from each value of F and multiplying the answers 
together, we shall get the total of products, or product-sum, 
by a method similar to that used in Section S.vii for calculating 
the standard deviation. We shall multiply together the actual 
values of X and F as they stand in columns 1 and 2, sum the 
products, and then subtract the product of X and F at the 
end, after having divided the product-sum by N. The validity 
of this may readily be seen by the following simple algebraical 
proof: 


(Z-Z)(F- F) = XY-XY- YX + XY. 

S(X - Z) (F - F) = 8{X Y) - X8[ Y) - Y8{X) -i- NX F. 


8{X-X){Y-Y) 

N 


8(XY) _^S(F) jy 

N N N 

8(XY) 


N 


'-XY-YX + XY 


S(XY) = = 
—^- 


^Accordingly we form a fifth column, headed ZF, by multi¬ 
plying together corresponding Z’scand F’s in the first two 




CORRELATION 


53 


6.iv] 

columns! The total of this column is S(XY), We may then 


obtain r irom the formula 


r = 


S(XY) 

___ - 




{16A) 


Example 17. Twenty pupils are given small tests in arith¬ 
metic and Latin, and the marks gained in each test, from a 
maximum of 10, are shown below. 


Pupil 

A 

B 

0 

D 

E 

F 

G 

H 

I 

J 

Arith. 

3 

9 

1 

8 

4 

1 

6 

9 

7 

8 

Latin 

1 

8 

4 

10 

6 

6 

5 

3 

8 

7 

Pupil 

K 

L 

M 

N 

0 

P 

Q 

R 

S 

T 

Arith. 

5 

4 

0 

5 

2 

6 

5 

4 

6 

2 

Latin 

2 

6 

5 

9 

4 

5 

7 

1 

3 

6 


Calculate the coefficient of correlation between the two sets of 
marks. 

We construct the five columns, as described above, and 
obtain 8{X), S(Y), ^(Z^), 8(Y^) and 8(XY), The actual 


arithmetic is shown. 



X 

Y 

Y2 

XY 


3 

1 9 

1 

3 


9 

8 81 

04 

72 

_ 

7 

' 4 49 

16 

28 

X 

8 

10 64 

100 

80 


4 

6 16 

36 

24 

Y 

1 

5 1 

25 

6 


6 

5 36 

26 

30 

By fo 

9 

3 81 

9 

27 


7 

8 49 

64 

56 

O-x 

8 

7 64 

49 

66 


5 

2 26 

4 

10 

(Tj, 

• 4 

6 16 

36 

24 


6 

5 36 

26 

30 

S{XY) 

5 

9 25 

81 

46 

N 

• *2 

4 4 

16 

8 

Hence, 

6 

5 36 

26 

30 


5 

7 26 

49 

36 

r 

4 

1 16 

1 

4 


6 

3 36 

9 

18 


2 

6 4 

26 

10 


_ 

- —- 

— 

— 


107 

104 673 

660 

,696 



20 


20 


/673 

I 20 ' 


■)= 


6-32* = 2-242, 


= 29-75. 


20 


29-76-5-35 X 6-20 
2-242 X 2-441 
= 0-353. 



54 


CORRELATION 


6.V. If the actual values of X and Y are large, a ^o(jd deal 
of arithmetic would be needed to obtain the third, fc^rth and 
fifth columns in the above method. Provided the ranges of the 
X's and P’s are fairly small, a modification of the method may 
be made by writing down the values of X and Y as they 
deviate from suitable arbitrary origins. Such arbitrary origins 
would be chosen near the middle of the range of each variable 
and the deviations of each X and Y from these would have to 
be written with due regard to sign. For example, if the values 
of X varied from 61 to 78 we might take 70 as arbitrary origin. 
In this case a reading of 61 would be written as — 9, i.e. 61 — 70. 
In the same way, 68 would become —2, 78 would be 8 and 
70 would be 0. 

In this manner we could replace our original X and Y 
columns by two other columns recording the deviations of the 
X’s and F’s from their arbitrary origins. Call these X' and 
Y\ From these the columns of squares and products may be 
obtained as before. In employing this method it is advan¬ 
tageous to set out the plus and minus values of X' and F' in 
separate columns, as shown in Example 18. 

In this case S{X')IN would give the difference between the 
arbitrary origin and the true mean of X, so that using the 
notation of Section 2.iii we may write S(X')IN = D^., and 
8{Y')IN = Dy, Similarly, from Section S.viii, with modifica¬ 
tions, we get 

Formula (16 A) is accordingly modified to the following in this 
case: 8{X’Y') 






(16B) 


The calculation of r by this method is exemplified below. • 

Example 18. Find the correlation between the two test 
scores given below for 10 subjects. 

Subject 123466789 10 

. a?estA 19 26 17 20 26 30 29 21 23 24 

TestB 146 161 140 144 138 V^O 142 160 149 160 



6.Vi] CORRELATION 56 

Test| A has a range from 17 to 30, so that a convenient 
arbitrary^ origin will be 26. For test B a convenient origin will 
be 146, as the range is from 138 to 150. 



X' 

III 



X'2 

y/2 

X'F' 

+ 


+ 

— 



+ ^ - 


- 6 

0 


36 

0 

0 

0 


6 


0 

36 

0 


- 8 


- 6 

64 

25 

40 


- 5 


- 1 

25 

1 

5 

1 



- 7 

1 

49 

- 7 

6 



- 5 

25 

25 

-25 

4 



- 3 

16 

9 

-12 


- 4 

6 


16 

25 

-20 


- 2 

4 


4 

16 

- 8 


- 1 

5 


1 

25 

- 5 

10 

-26 

20 

-21 

188 

211 

45 -77 


-16 

-1 



-32 



-16/10 = - 

-1-6; 

1 

II 

-1/10 = 

-0-1, 



V(188/10- 

1-62) = 

4-030, 




(Ty = V(211/10~0-12) = 4*592. 


Hence 


-32/10-(-l*6x -0*1) 
4*030’x"4*592 


-0*183. 


6.vi. Product-moment correlation when N is large. 

If N is large, over 80, the foregoing methods of calculating r 
become very laborious and it is usual to curtail the arithmetic 
involved by the use of a tabular method. For this purpose we 
construct what is known as a correlation table. The construc¬ 
tion of such a table is most easily understood by reference to 
an example such as is given in Table IV, which illustrates the 

correlation between two tests, X and Y. 

* • 

In this table each of the test scores has been grouped for^ 
working purposes (see Section 2.iii). For test X the convenient 
group unit was 2 and accordingly the X groupings are written 
along the top of the table. Test Y had a larger range and tde 
group unit chosen was 10; the Y groupings are written at the • 
left-hand side of the table. 



56 


CORRELATION 


[6,vi 


We next proceed to make a spot diagram showing the^distri- 
bution of the pairs of scores. For example, one subject scored 
0 in test X and 65 in test Y ; accordingly a dot is put in the 
square, or cell, with the X grouping 0, 1 above it and the Y 
grouping 60-59 on the left of it. Similarly a dot is made in the 
appropriate cell for each other pair of scores. Hence each dot 
represents a score both in test X and in test Y and the total 
number of dots will be N, the number of subjects who took the 
tests. The number of dots in each cell is counted and the 
number written in the cell. (If the data are on cards, the cards 
may be sorted into their appropriate groups and counted, thus 
obviating the necessity of making a spot diagram first.) 

The total number of observations in each horizontal row, or 
array, is then found and recorded on the right of the table. 
These will form a column headed fy, which will give the grouped 
frequency distribution of F. Similarly, at the foot of each 
column of the table the total number of observations is re¬ 
corded, giving a horizontal row for/p, or the grouped frequency 
distribution of X, The totals of this last column and row, i.e. 
8{fy) and S{f^), should be the same and equal to N. 

We then choose an arbitrary origin for each test, call the 
corresponding group 0, and then number the groups on both 
sides as we did in Section 2.iii, This will give us a column on the 
right headed ‘t/’ and a row at the top of the table which will 
he \ The means and standard deviations of both X and Y 
in working units may now be calculated, using the method of 
Section 3. viii. This is most conveniently done at the side of the 
table (see Example 19). There is no need to obtain the true 
means and standard deviations of the tests—indeed it is im¬ 
portant to keep all the calculations in working units through¬ 
out. Hence we calculate and Dy, using the notation of 
Section 2.iii, and and (Xy^, using the notation of formula 
..(7A). 

There now remains the problem of finding the sum of the 
xy products. Reference to Table IV will show how this is done. 

^Starting with the top horizontal array, write in each cell the 
«product of the frequency in that cell and the a;-grouping above 
it. For instance, the frequency in the first cell is 1 (written in 



6.Vi] CORRELATION 57 

the bottom right-hand corner) and the a;-grouping for that cell 
is — 6; hc^nce we wnte — 6 in the top right-hand corner of the 
cell. Continue this for each cell containing observations. Then 
add these small top right-hand comer numbers for each hori¬ 
zontal array and record the totals in a column on the right of 
the table. Head this column indicating the total of the 
a;’s for each y. As a check on the arithmetic so far, sum this 


Table IV 


X 


X -S -S -4 -3 -2 -1 0 1 2 3 4 5 



y r,. 


-2 -10 


-I -8 


0 -18 

1 -10 

2 3 

3 8 

4 6 

5 11 

• 10 


column, and the total of it, should be equal to 2(/^a:), 

^as^was found in calculating the mean of x. 

Finally, we make one further column on the right of the^ 
table by multiplying together the entries in the y and y 
columns: this new column will be headed yT^,y. Sum this 
column to obtain Yi(yT^ y). This gives us the sum of the xy 
products. Care should be taken with the signs in all these* 
calculations. • 



58 


CORRELATION 


[6.vi- 


We have now all the necessary data and the cdniplation 
coefficient may be found by substitution in theimodified 
formula: 'LlvT ) 


r = 


N 


(16C) 


'yui 


The whole of the arithmetic involved is shown in the following 
example. 


Emmple 19. Calculate the coefficient of correlation between 
the tests X and Y in Table IV. (See p. 59 for working.) 


6.vii. The diagonal summation method. There is an 
alternative method, which is preferred by some computers, of 
calculating r from a correlation table of grouped data. The 
correlation table is constructed as in Example 19 but the small 
product figures in the top right-hand corners of the cells are 
omitted. As in that example the columns for/^,, y,fyy and/^y^ 
are written, and also those for/,., x, /^.x and f^x^. From these 
are calculated two quantities which we shall denote by A 
and jB. These are obtained as follows: 




We now construct a further column by summing the total 
frequencies in each diagonal of the table. In order to obtain the 
correct sign for r, it is important that these diagonals shall run 
from the comer of the table where both variables have low 
scores to the corner where both have high scores; in Example 
19, for instance, the diagonals will all be parallel to that 
running from the top left-hand corner to the bottom right- ^ 
hand comer. This column we head /^. 

We then proceed as though we were finding the standard 
deviation of this column, i.e. an arbitrary origin is chosen and 
nttmbered 0 and the frequencies numbered positively and 
‘»negatively on the two sides of this origin, yielding a column 
headed d. Two further columns, headed f^d and f^d^y are 

















60 


COEEBLATION 


[6.vU- 

obtained in the usual way and summed. From these*tc^als we 
then obtain a third quantity which we shall denote l}y C. This 
is given by 


C7 = S(/,d*)- 


[^Uad)f 

N ’ 


36 — 4S 63 32 21 0 16 30 27 32 25 335 

/,» -6 — -12 -21 -10 -21 0 15 16 0 8 5 G0-76«-21 

X-6-6-4-S-2-1 0 1 2 3 4 5 



The coefficient of correlation may then be obtained fr6m!^ 
bhe formula 


r = 


A+B-0 
2 V(iJ5) • 


(16D) 


(It is left to the student to relate this formula to formula (16). 
He should have no difficulty in doing this if he bears in mind 



CORRELATION 


6.viii] 


61 


that if ^ and Y represent deviations of the two variables from 
their mea^is, then 


28{XY) = 7)2; 

and that (Z — Y) is constant for any diagonal.) 

The whole of the arithmetic involved is shown in the 
following example. 


Example 20. Calculate the coefficient of correlation of the 
data in Table IV using the method of diagonal summation. 

Since in this case the fa column will have to be written on the 
right of the table, it is convenient to write the fy and associated 
columns on the left and the and associated columns above 
the table. The setting out of the working is shown on p. 60. 


6.vili. Significance of product-moment correlation: 
standard error of r. The standard error of r is usually 
calculated from the formula 


S.e.ofr = -^^*. (17) 

As usual, the probable error is 0*67449 times the standard 
error. 

This formula is approximately true when N is large and the 
values of r are small or moderate in size. In such cases the 
correlation may be taken as differing significantly from zero if 
r is more than twice its standard error, or more than three times 
its probable error. 

In small samples, however, or with very large values of r, the 
above formula is not true and the significance of r should be 
assessed by the t method. For correlation coefficients 


f = r 


V(i-»•*)■ 


(18) 


it*is unnecessary to calculate t for each value of r that is 
found. Appendix B gives graphically the criterion of signi¬ 
ficance for r for values of N from 60 to 270. The significance of 
r when N is 60 or less may be found from the following tabled 
extracted from R. A. Fisher (Ref. 2), Table V A, or Fisher and 
Yates (Ref. 6), Table VI. • 



62 


CORRELATION 


[6.vili~ 


Table V. Values of r for P = 0-05 
n = JV-2 


n 

r 

n 

r 

1 

0*997 

14 

0*497 

2 

0*950 

15 

0*482 

3 

0*878 

16 

0*468 

4 

0*811 

17 

0*466 

6 

0*756 

18 

0*444 

6 

0*707 

19 

0*433 

7 

0*666 

20 

0*423 

8 

0*632 

25 

0*381 

9 

0*602 

30 

0*349 

10 

0-51Q 

35 

0*326 

11 

0*563 

40 

0*304 

12 

0*632 

45 

0*288 

13 

0*614 

50 

0*273 


In using the above table n is 2 less than N, the number of 
pairs of observations in the correlation. If the calculated value 
of r is as big as or bigger than the value given in the table for 
the appropriate value of n, the correlation differs significantly 
from zero, i.e. it indicates a real degree of association between 
the two variables. 

The use of these various assessments of significance is 
illustrated in the following example. 


Example 21. The scores in two tests on 100 subjects are 
correlated and the value of r obtained is 0* 35. Is the correlation 
significant? 

(a) By formula (17): 


S.e. of r = - - 


0-1225 

io 


0-08775. 


Hence r is significant, since 0*35 is just about 4 times its 
standard error. 

(6) By formula (18): 

f- 0-35 V(98) 

V(1-0*1225) ‘ 

*rhis value of t is much larger than that given in Table III, 
p. 37, for n = 30, so that it must be even larger than that for 
n = 98; hence r is significantly larger than zero. 



6.ix] CORRELATION 63 

(c) By consulting the graph in Appendix B it will be seen 
that for iV,= 100 a value ofr of O’193 is significant. Hence our 
present value of 0‘36 is definitely significant. 

6 .ix. Significance of the difference between two corre¬ 
lations. We sometimes wish to know whether the correlation 
between two variables is different in two different samples. To 
do this we make use of a method devised by Fisher (Ref. 2) 
which entails transforming r into a quantity which he calls z. 
This is given by the formula 

Z = i{loge (1 + r) - log, (1 - r)}. 

Once again there is no need to make the actual calculation as 
Table VI gives values of z corresponding to values of r up to 
0*500, and vice versa. For values outside the range of this 
table the student is referred to Table V B in Fisher (Ref. 2), or 
Table VII in Fisher and Yates (Ref. 5). 


Table VI. Conversion of r into z and z into r 



For z 


For r 

r 

add 

z 

subtract 

0-000 0-114 

0-000 

0 * 000 - 0-114 

0-000 

0 - 115 - 0-163 

0-001 

0-115 0-165 

0-001 

0 - 164 - 0-194 

0-002 

0 - 166 - 0-196 

0-002 

0 * 195 - 0-216 

0-003 

0 - 197 - 0-220 

0-003 

0 - 217 - 0-235 

0-004 

0 * 221 - 0-240 

0*004 

0 - 236 - 0-251 

0-005 

0 - 241 - 0-256 

0-005 

0 - 252 - 0-265 

0-006 

0 - 257 - 0-271 

0-006 

0 - 266 - 0-277 

0-007 

0-272 0-285 

0-007 

0 - 278 - 0-288 

0-008 

0 - 286 - 0-297 

0*008 

0 - 289 - 0-299 

0-009 

0 - 298 - 0 - 309 ^ 

0-009 

0 * 300 - 0*309 

0-010 

0 - 310 - 0-320 

0-010 

0 * 310 - 0-318 

0-011 

0 - 321 - 0-330 

0-011 

0 - 319 - 0-327 

0-012 

0 - 331 - 0-339 

0-012 

0 * 328 - 0*335 

0-013 

0 * 340 - 0*348 

0-013 

0 * 336 - 0-343 

0-014 

0 * 349 - 0-357 

0-014 

0 * 344 - 0*350 

0-015 

0 - 358 - 0-365 

0-015 

0 * 351 - 0*357 

0-016 

0 * 366 - 0*373 

0*016 

0 * 358 - 0*364 

0-017 

0 * 374 - 0*381 

0*017 

0 * 365 - 0-371 

0-018 

0 - 382 - 0-389 

0-018 

0 * 372 - 0*377 

0-019 

0 * 390 - 0-396 

0*019 

0 * 378 - 0*383 

0*020 

0 - 397 - 0-403 

0*020 

0 * 384 - 0 * 388 . 

0*024 

0 - 404 - 0-409 

0*021 



64 OOBEBLATION [6.ix- 

I 

Table VI (contimied) * 



For z 


For r 

r 

add 

> z 

subtract 

0-389>^0-393 

0*022 

0*410-0*416 

0*022 

0-394-0-399 

0*023 

0*417-0*422 

0*023 

0-400-0-404 

0*024 

0*423-0*428 

0*024 

0-406-0-409 

0*026 

0*429-0*434 

0*026 

0*410~0-414 

0*026 

0*436-0*440 

0*026 

0-415~0-419 

0*027 

0*441-0*446 

0*027 

0*420-0*423 

0*028 

0*447-0*462 

0*028 

0*424-0*428 

0*029 

0*453-0*467 

0*029 

0*429-0*432 

0*030 

0*468-0*463 

0*030 

0*433-0*436 

0*031 

0*464-0*468 

0*031 

0*437-0*441 

0*032 

0*469-0*473 

0*032 

0*442-0*446 

0033 

0*474-0*478 

0*033 

0*446-0*449 

0*034 

0*479-0*483 

0*034 

0*450-0*463 

0*036 

0*484-0*488 

0*035 

0*464-0*466 

0*036 

0*489-0*493 

0*036 

0*467-0*460 

0*037 

0*494-0*498 

0*037 

0*461-0*464 

0*038 

0*499-0*602 

0*038 

0*466-0*467 

0*039 

0*603-0*607 

0*039 

0*468-0*471 

0*040 

0*608-0*612 

0*040 

0*472-0*474 

0041 

0*613-0*616 

0*041 

0*476-0*478 

0*042 

0*617-0*620 

0*042 

0*479-0*481 

0*043 

0*621-0*626 

0*043 

0*482-0*484 

0*044 

0*626-0*629 

0*044 

0*486-0*488 

0*046 

0*630-0*633 

0*046 

0*489-0*491 

0*046 

0*534-0*637 

0*046 

0*492-0*494 

0*047 

0*638-0*642 

0*047 

0*496-0*497 

0*048 

0*643-0*646 

0*048 

0*498-0*600 

0*049 

0*647-0*650 

0*049 


To use this table, look up the value of r in the left-hand column and 
add to it the corresponding value in the second colunm, as 2 is always 
bigger than r. To turn z into r, look up z in the third column and 
subtract the corresponding entry in the last column. 


For a value of N greater than 3, the standard error of z is 
— 3), and the standard error of the difference between 
two 2 ’s is 




Vhere and are the numbers of pairs in the two samples. 
As usual, if the difference between the two z's is greater than 
twice its standard error, then the difference is significant. 



CORRELATION 


6.x] 


65 


Exar^bple 22. Two groups of children, one of average age 11 
and the o|)her of average age 14, are given an intelligence test 
and an arithmetic test, and the scores are correlated for each 
group separately. The nuihbers in the groups and the correla¬ 
tion coefficients were as follows: 


11 year olds, = 43, = 0-48, 

14 year olds, = 39, = 0-39. 

Can the correlation between intelligence and arithmetic be 
regarded as different in the two groups? 

From Table VI, 

= 0*523 and = 0*412. 

The difference is 0*111. 

S.e. of difference = J + = 0*230. 

Hence the difference between the g’s is just less than half its 
standard error and so cannot be regarded as significant. 


6.x. Mean of several values of r. Use may also be made 
of the z transformation to obtain the average of several values 
of r. Coefficients of correlation should not be regarded as 
ordinary numbers which may be added and divided, and due 
weight should be given to the number of pairs in each corre¬ 
lation, values calculated from larger samples being more 
important than values from smaller samples. 

To average several values of r, first transform each r into z 
and then multiply each 2 ; by iV' ~ 3, where N is the number of 
» pairs in the original r. Sum these products and divide the total 
by the sum of the (N — 3)’s, giving the mean value of z. Finally 
j)r^nsform this mean z back into r, and this wiU be the correct 
value for the mean of the original correlation coefficients. The. 
calculation may be conveniently tabulated as in the following 
example. 

Example 23. The same two variables are correlated in three 
different groups. The numbers in the groups and the values of 



CORRELATION 


66 


[6.x- 


r are given below. What is the average correlation in‘tlje three 


groups? 



N 

r 

Group 1 

23 • 

0*41 

Group 2 

28 

0-35 

Group 3 

35 

0-50 


Tabulate the work as follows: 


r 

z 

isr-3 

{NS)z 

0-41 

0-436 

20 

8*720 

0*36 

0*366 

25 

9-125 

0*50 

0*540 

32 

17-568 



77 

35-413 


Mean z = = 0*460. 

77 

For z = 0*460, r = 0*430 (from Table VI). Hence the 
average correlation in the three groups is 0*43. 

The significance of an average r may be tested as though it 
had been calculated from 8{N — 3) + 3 pairs of observations. 
In the above example the average r may be tested for signi¬ 
ficance as though it had been a single r calculated from 80 pairs. 
It is therefore definitely significant, although of its component 
r’s only that in group 3 is significant by itself. 


6.xi. Partial correlation. In some cases it seems pro¬ 
bable that two variables X and Y are correlated partly on 
account of the fact that each of them is correlated with a third 
variable, Z, For instance, it may be that there is a correlation 
between scores in an arithmetic examination and scores in a 
Latin examination, partly because ability to do both arith¬ 
metic and Latin is correlated with intelligence. In such a case 
we may wish to find the correlation between X and Y quite 
apart from the influence of Z. This may readily be done by the 
method of 'partial correlation. All we need to know is the thfecJ 
" correlations, viz. rxT^ between X and F, rxz between X and Z 
and ryz between F and Z. The correlation between X and F 
Kdth the influence of Z removed is then given by the formula 

r — ^X r '^^XZ*^TZ 

rxr.z 


( 19 ) 



CORRELATION 


67 


6.xii] 

The symbol r^y.z is read ‘the correlation between X and Y 
keeping Z constant’. 

A table giving the values of (1 — for all values of r to 3 
places of decimals may be 'found in Tables for Statisticians and 
Biometricians, Table VIII, p. 20 (Ref. 3). 

Example 24. Three tests, A, B and C, were given to a group 
of students and the three sets of scores were correlated with 
each other, giving the following coefficients: 

~ b'66; rj^Q = 0-60; '^bc 0*40. 

What is the correlation between A and B keeping C constant? 
From formula (19), 

0-66-0*60x0-40 
'^AB.c- '^(11.0.36) V(l-“0*r6j 

0-42 

■* 0^73^ 

= 0-57. 


The above formula and method may be extended to four or 
more variables. The formula for four variables is given below: 
the student is unlikely to need further extensions. 


'^AB.CD 


__ '^ab.c ^'^ad.c-'^bd .c 

V(i \/(i 


(19A) 


It will be seen that this partial correlation, the correlation 
between A and B keeping both C and D constant, requires the 
calculation of three other partial correlations, each keeping 
only one variable constant. 


6 .xii. Significance of a partial correlation coefficient. 

The significance of a partial correlation coefficient may be 
determined by calculating t, which in this case is given by 

where is the partial coefficient and p is the number of** 
variables held constant. The number of degrees of freedom for 
consulting the table of t (Table III) is w = iV—p — 2. Thus 
Example 24, if there were 39 students in the group, we have 
rp = 0*67, p = 1, and N =^39. 


5-2 



68 


CORRELATION 


[6.xii 


0-67 X 6-0 
V^-6751 
= 4-17. 

Reference to Table III with n = 39—1 — 2 = 36 shows the 
coefficient to be clearly significant. 

EXERCISES ON CHAPTER VI 

20. Calculate the coefficient of correlation by the product-moment 
method between: 

(а) subjects 1-26 in tests G and £), 

(б) subjects 26-60 in tests G and D, 

(c) subjects 61-76 in tests G and X>, 

(d) subjects 76-100 in tests G and £>. 

Use formula (16 A). 

21. Check the results of Exercise 20 by using the method of 
Section 6.v and formula (16B). In each case take 26 as arbitrary 
origin for test G and 30 as arbitrary origin for test D, 

22. Calculate the coefficient of correlation between each test from 
A to O inclusive and each other one for the whole 100 subjects. 
Employ the method of the correlation table, Section 6.vi, using formula 
(16C) and the grouping adopted in Exercise 3. Repeat using the 
method of diagonal summation and formula (16D), 

23. Using formula (17), calculate the standard error of the values 
of r obtained in Exercise 20 for the correlations between test F and 
the other tests from A to .0, Hence determine the significance of these 
values of r and check by reference to (a) Table V and (6) the graph in 
Appendix B. 

24. Using the correlation coefficients given in the answer to 
Exercise 20, determine whether is significantly different from 

method of Section 6.viii and 

Table VI. 

26. From the correlation coefficients given in the answer **‘t<5’ 
« Exercise 20, calculate the following partial correlation coefficients: 

(а) between Z> and F, keeping G constant; 

(б) between G and F, keeping D constant; 

(c) between F and (?, keeping E constant; 

, (d) between E and F, keeping O constant; 

(e) between G and F, keeping B constant. 



Chapter VII 

OTHER METHODS OF CORRELATION 


7.i. The ranking method, (a) Spearman’s coefficient. 

It sometimes happens that the actual values of two variables 
cannot be accurately measured, although we are able to rank 
them in order of size or merit. In such a case the method 
of product-moment correlation cannot be applied, but an 
approximate coefficient of correlation may be calculated. If 
N pairs of variables are ranked, X and Y being ranked sepa¬ 
rately of course, and d represents the dilBference between the 
ranks of X and Y for any one pair, then the coefficient of ranked 
correlation is given by the formula 




6S(d2) 


( 21 ) 


This formula may be used whether the distributions of X and 
Y are normal or not, but it only gives an approximate indica¬ 
tion of the association between the two variables and should 
never be used if it is possible to calculate r. The coefficient 
should not be employed in partial correlation, multiple corre¬ 
lation, factorial analysis or any other statistical process which 
is based on product-moment correlation. 

The method of ranked correlation is frequently employed 
because the calculation involved is simple for small samples. 
The method of calculation is given here for use in cases where 
^ the data are inadequate for product-moment correlation. 

The first step is to rank each variable, calling the best or 
biggest value 1, the second best or biggest 2, and so on. When 
or more values of a variable are the same it is usual to give 
each the average rank. For instance, if there are two equal' 
values for the 6th place, each is ranked as 6 J, the next rank 
being 8, since these two will have occupied the 6th and 7th^ 
places between them. Similarly, if there are three equal values 
for the 10th place, each wjU be ranked as 11, since they will 



70 OTHER METHODS OP CORRELATION [7.1 

occupy the 10th, 11th and 12th places between theAi,,and the 
next value will be ranked as 13. By this means, if there are N 
pairs of variables, each variable will be ranked from 1 to 
(This averaging of ranks introduces a further inaccuracy into 
formula (21), for the formula assumes that each ranking is 
different: it is, however, frequently impossible to avoid it in 
using this method.) 

The pairs of ranks are written down in two columns and a 
third column, headed d, formed by subtracting entries in the 
second column from corresponding entries in the first. If the 
correct signs are put down in this column, the total of the 
column, or S(d), should be zero. Each entry in the third 
column is then squared, yielding a fourth column headed d*. 
Finally, this column is summed to obtain S(d®), which is then 
substituted in formula (21). 

In Appendix C are given values of the reciprocal of 
6IN{N^ — 1) for values of N from 10 to 60. The use of this, as 
' explained in the Appendix, will save the student a good deal of 
arithmetic. 

When N is moderately large, the significance of p may be 
tested by calculating t, which is given in this case by 

In referring to Table III, Ny the number of {)airs of ranks. 

Example 25. A class of 15 schoolboys was given an intel¬ 
ligence test and the master provided their order of merit in an 
entrance examination. What is the correlation between the 
test and the entrance examination? Below are the marks and 
order of merit of each boy. 


Boy. 

A 

B 

C 

D 

E 

F 

Q 

H 

Order of merit 

1 

2 

3 

4 

6 

6 

7 

8 

Intelligence test score 

22 

19 

6 

18 

20 

16 

11 *9 

"Boy. 

I 

J 

K 

L 

M 

N 

0 


Order of merit 

9 

10 

11 

12 

13 

14 

16 


tJEntelligence test score 

16 

12 

10 

7 

13 

12 

8 



Here one of the variables is ranked for us and may be written 
straight down. As regards the marjcs for the test, boy A scores 



7.1] ^ OTHER METHODS OF CORRELATION 71 

most aivi is given the rank 1; ^ is next and is ranked 2; B is 3, 


and so on,,finally giving us the second column. 
Order of Test 

Boy merit rank rank d 


A 

1 

1 

0 

0 

B 

2 

3 

- 1 

1 

G 

3 

15 

-12 

144 

D 

4 

4 

0 

0 

E 

5 

2 

3 

9 

F 

6 

5 

1 

1 

O 

7 

10 

- 3 

9 

H 

8 

12 

- 4 

16 

I 

9 

6 

3 

9 

J 

10 

oo 

u 

2i 

K 

11 

11 

0 

0 

L 

12 

14 

- 2 

4 

M 

13 

7 

6 

36 

N 

14 

Si 

5i 

30i 

0 

15 

13 

2 

4 


/9 = 1- 

6 X 265-5 _ 
15 X 224 ~ 

22 

-22 

0 

1-0-474 = 0-526 

266i 


Examining the significance of this value of p we have by 
formula (21 A), 

‘ - 

This is just greater than the critical value of 2*21 given in 
Table III for » = 15, hence we may conclude that p is signi¬ 
ficant. 

, (b) Kendall’s coefficient. Suppose we have n objects 

ranked for each of two variables, X and Y. For convenience 
these rankings may be written horizontally as below: 



A 

B 

G 

D 

E 

F 

0 

H 

X 

1 

2 

3 

4 

5 

6 

7 

8 

Y 

1 

5 

11 

2 

3i 

8 

6 

3i 


I 

J 

K 

L 

M 

N 

0 


X 

9 

10 

11 

12 

13 

14 

15 


Y 

13 

7 

11 

9 

15 

14 

11 





72 OTHER METHODS OF CORRELATION i [7.i 


There are 16 objects ranked here for X and Y ; an<i for sim¬ 
plicity, though this is not necessary, the X rapkings are 
arranged in ascending order. 

The essence of Kendall’s method, which yields a coefficient 
of ranked correlation called t (tau), is to compare each pair of 
ranks with each other pair and to allot to each a score of 1, 0 
or — 1. Let Xf and Xj be the ith and jth rankings for X, where 
Xf is to the right of X^. If X^ is smaller than X^ a score of 1 is 
allotted to that pair: if they are equal, the score is 0, and if 
X^ is the larger, the score is — 1. Call the score for this pair 
Repeat the process for the F rankings and call the score for 

the Y^Yj pair 6y. With n objects there will be a total of 



pairs. For each pair, multiply Oy and 6y together and sum the 

products over the whole pairs to obtain S(ay6y). Then 

T, the Kendall coefficient of ranked correlation, is given by 
the formula ^ _ S(ay6y) 


This is the value of the coefficient when there are no tied 
rankings. When there are ties, as in the Y rankings above, the 
denominator of formula (22) needs to be modified. There may 
be several sets of tied ranks in one row, perhaps 2 ranks being 
tied in one set, 3 in another, and so on. If t is the number of 
ranks tied in any one set, calculate \t(t—\) for that set. 
Repeat for all sets of ties in that row and add all the results. 
Call the total for the X row and for the Y row. Then the 
coefficient, when there are ties, is given by 


__ 

4{{\n{n -1) - UJ [iw(R -1) - U„]}' 


(22 A)* 


Example 26. Calculate r for the data at the beginnings <rf 
••this section. 

First compare the ranking for A with that for B. In the 
row, 1 is less than 2 so that a^^ = 1. In the Y row, 1 is less 
than 6, so that is also 1. Hence for the pair AB, 

= 1 1 ^= !• 



7.i] ) OTHER METHODS OF CORRELATION 73 

Next,cc)mpare A with C7, A with D and so on, until A has been 
compared with each of the others B to 0. Add the 14 values of 
thus obtained and record the total. We have now finished 
with the A rankings. Ndxt compare the B rankings with each 
of the 13 others, C to 0, in the same way. Some of these will 
be found to be negative, e.g. for BD, = 1 and = — 1, so 
that = — 1. Also, for the pair EH, for example, = 1 
and 6^^ = 0, so that a^jb^j = 0. Listing the various sub-totals 
we find the following: 


Compared 

Sub-total 

with remainder 

(aijhij) 

A 

14 

B 

7 

C 

-4 

D 

11 

E 

9 

F 

3 

G 

6 

H 

7 

I 

-2 

J 

5 

K 

1 

L 

3 

M 

■ -2 

N 



Add the various sub-totals which gives us which in 

this case is equal to 57. 

There are no ties in the X row, so that = 0. In the Y row 
there are two sets of ties. E and H are tied: for these t == 2 and 
1) = 1. C, K and 0 are also tied: for these ^ = 3 and 
\t{t— 1) = 3. Hence 1^5= l-f-3 = 4. Substituting in formula 
(22 A) we find 

. ^{105(105-‘i)} ^ 

(For the same data, as the student may verify, p = 0-717. It 
will be found that as a rule the numerical value of r is less than 
that of p.) ^ 

In the above example the X ranking was written out iij 
order. This is not essentiiil, indeed actual ranking is not neces- 



74 OTHEB METHODS OE CORRELATION [7.1- 

sary; for instance, below are the scores of 7 subjects in two 
tests, X and Y : 



A 

B 

O 

D 

. E 

F 

0 

X 

46 

51 

52 

47 

60 

48 

61 

Y 

13 

15 

12 

10 

16 

13 

18 


Pairs of scores may be compared and values for and 
allotted as usual, dependent simply upon whether the second 
score in each pair is larger, equal to or smaller than the first. 
For the above data the student should find that r = 0*586. 
If now a further subject, H, were tested and obtained scores 
of 57 in X and 17 in Y,t for all 8 subjects could be calculated 
by comparing the scores of subjects AtoG with those for H 
and adding the sub-total for thus obtained to the b^j) 

for the 7 subjects, and remembering that n in the denominator 
of formula (22 A) is now 8 instead of 7. The student may verify 
that for all 8 subjects, r = 0*618. 

This illustrates a very useful property of Kendall’s method 
of calculating the coefficient of ranked correlation, viz. that 
additional data may be added and r calculated without having 
to re-rank each time, as is necessary in Spearman’s method. 
If our data were in a time series, therefore, a fresh pair of 
readings being made each week, say, we could calculate a 
running value for r and see how it varied with time. 

7.ii. Significance of t. To test the significance of t we 
first need to calculate the variance of which we will 

call varS. When there are no tied rankings this is given by 

(23) 

lo 

Take the square root of this, which will give the standard 
deviation of S, and then if the ratio of — 1 to ^J\a,TX* 

i» greater than 1*65, the value of r is significant. 

Eocample 27. Test the significance of a value of r of 0*667. 
^TKere were 10 pairs of readings and was 30. 


* See Appendix F for a table to help w the calculation of varS. 



OTHER METHODS OF CORRELATION 


76 


7.iv] , 

Fr(jm' formula (23), 

varS = 10x9x25/18 = 125, 

VvarS = 1M8 
30-1 

VvarS ~ 1M8 " ^ 

This is greater than 1'65 so that t is significant. 

When there are tied rankings in either or both rows, the 
expression for var E is more complicated. Suppose there are 
ties to the extent of t^, etc., in the X row; for each set 
calculate t{f,— 1) {2t+ 5) and sum the sets to obtain 
I,t{t-l)(2t + 5). 

Similarly for the Y row, calling the ties tt^, u^, etc., we obtain 
'Lu(u— 1) (2m- f 6). Other adjustments involving the values of 
t and M have to be made, so that the complete expression for 
var S when there are ties present is 

var E = ^) (2*^ + 5) — E<(i — 1) (2t + 5) 

— Em(m—1) (2m +5)} 




(23 A) 


Using this the significance of r may be tested as before. 

Partial rank correlation. Using the Kendall 
coefficients of ranked correlation it is possible to calculate 
partial correlation coefficients. It happens that the formula 
is similar to formula (19) for partial product-moment correla- 
tion,infact 

As yet no tests for the significance of a partial t are known. 

■JSo similar expression using Spearman coefficients is available. 

7.1v. Biserial correlation. Observational data some¬ 
times make it impossible to calculate either product-moment 
or ranked correlation coefficients. For example, a test may 
applied to a group of subjects about whom the only other in; 

* See Appendices F and p for tables to help in the calculation. 



76 OTHER METHODS OF CORRELATION | [7.iV 

formation we possess is that each of them has either paired or 
failed a particular examination. In such a case, after making 
two assumptions, we may calculate a form of correlation 
coefficient known as the coefficient of biaerial correlation. This 
we shall denote bis, r. The two necessary assumptions are: 

(1) that the dichotomous variable is normally distributed, 
and 

(2) that the regression of JC on 7 is linear (see Chapter vm). 
If both these assumptions are deemed justifiable and if we have 
more than 80 observations, so that the data may be grouped, 
the bia, r coefficient may be calculated as follows. 

Suppose X is the numerical variable, i.e. the test, and 7 the 
dichotomous variable, i.e. the variable divided into only two 
parts. Choose a convenient group unit for X and divide the 
observations into the appropriate groups. Then construct a 
two-row, or biserial, table showing how the subjects who pass 
or fail in 7 fall into the X-groups. Such a table might appear 

“ Test X 

-.6 -4 -3 -2 -1 0 1 2 3 4 6 6 7 Total 
-fPass 0 1 0 4 699 10 63422 66 
^ (Fail 1 02679 11 42 


2 0 1 0 46 


Total 1 


2 10 12 18 20 14 8 6 4 3 2 100 


As in calculating the standard deviation, we choose an arbi¬ 
trary origin for X and number off the groups on both sides, as 
in Section 2.iii. This has been done in the table above. 

Now let us call the portion of the a;’s which fall into the larger 
part of the 7 distribution, x^. Then the mean of this row (which 
will be the Passes in the above table) will be x-^. The mean of all 
the a;’s will be x and their standard deviation or^. All these 
statistics may be calculated, in working units, from the table. 
The corresponding statistics for 7 cannot be calculated 
directly, but assuming that we knew them, the coefficient 
biserial correlation would be given by the formula 



7Av] j OTHER METHODS OF CORRELATION 77 

Although we cannot calculate ^ directly from the table, 

we may Obtain it indirectly (on the above assumptions) from 
data given in Table II of* the Tables for Statisticians and Bio¬ 
metricians (Ref. 3). This table gives a quantity J(1 + a)* and 
the values of a function z corresponding to them. (This 2 is not 
to be confused with the z used by Fisher as a transformation 
for r, as in Section 6.ix.j The quantity ^(1 + a) in our case is 
equal to n-^fN, where n^ is the number of observations in the 
larger portion of Y and N is the total number of observations 
in the whole table. This is readily calculated and the value of 
z corresponding to it may be found by interpolation. (The 
method of doing this may be best understood from Ex¬ 
ample 28.) 

We then have Vx — y 2 

The coefficient of biserial correlation is therefore obtained by 
calculating ~—- from the two-row table and dividing this by 

the value of ——- obtained from the statistical tables. 

The sign of bis, r has to be determined by inspection. If, for 
instance, the mean x score of the Passes is greater than that of 
the Failures, then the correlation is positive. 

This method of correlation should not be resorted to if it is 
possible to avoid it. In most cases it gives little more informa¬ 
tion than would be acquired by investigating the significance 
of the ic-score means difference of the two F-groups. In any 
case, the ordinary standard error of r does not apply to bis, r, 
and caution is needed in interpreting the coefficient. (The 
standard error of bis, r is known only to a first approximation 
aaad the student who wishes to look further into this matter is 
referred to H. E. Soper, Biometrika, vol. x, p. 384, 1914.) » 

The method of calculation is shown in the example below. 

♦ If a vertical is drawn at any point cc in a normal curve, the total 
area is divided into two unequal portions, if x deviates from the mean., 
J( 1 + a) is the area of the greater portion. 



78 OTHER METHODS OF CORRELATION | [7.iV 

Example 28. Calculate the coeflScient of biserial cotr^lation 
for the two-row table given above. 

The total column at the foot of the table gives th^ grouped 
frequency distribution of aj, so that ^«;nd (t^ may be calculated 
by the method of Section S.viii. 


/ 

X 


Jx^ 




1 

-5 

- 6 

25 




1 

-4 

- 4 

16 




2 

-3 

- 6 

18 




10 

-2 

-20 

40 




12 

-1 

-12 

12 


97 


18 

0 

-17 

0 

x( 


= 0-97, 

20 

14 

8 

1 

2 

3 

20 

28 

24 

20 

56 

72 


//645 

"■ V \ioo“ 

■0-97*j 

5 

4 

20 

80 

= 

2-347. 


4 

5 

20 

100 




3 

6 

18 

108 




2 

7 

14 

98 




100 


144 

645 






-47 






97 

The mean x-^ is obtained in a similar manner by multiplying 
X by the entries in the top row; these we may call/i. 


/i 

0 

1 

0 

4 

5 
9 
9 

10 

6 

3 

4 
2 

_2 

55 



0 

- 4 
0 

- 8 
- 5 

””9 

20 

18 

12 

20 

12 

14 

105 

-17 

88 




S(Aa; ) ^ 88 
55 



79 


7.iv] ) OTHER METHODS OF CORRELATION 


From \he biserial table, therefore, 
Xi — x^ 1*6 —0-97 


\ 


cr^ 2-347 


= 0-2684. 


Now we need to calculate --—^ from the tables in Ref. 3. 


= 55 and N = 100. Hence 


i(l + a) = 55/100 == 0-55. 

From the table we find the following values of z corre¬ 
sponding to two values of ^(1 -ha): 

for J(1 +a) = 0-5477584, z = 0-3960802, 
for 1(1 +a) = 0-5517168, 2 = 0-3955854. 

Our value of J(1 -ha) lies between these two. Now the dif¬ 
ference between the two values of |(l-ha) given above is 
0-0039584. This corresponds to a difference in z of 0*0004948. 

Our value of ^(1 -fa) is 0-55 — 0-5477584 = 0-0022416 above 
the smaller of the two values given above. It will be noticed 
that as 1(1 -fa) gets larger, z gets smaller: hence we have to 
subtract from the first z that part of the difference between the 
two z’s which is proportional to 0-0022416/0-0039584, i.e. the 
value of z corresponding to ^(1 -fa) = 0-55 is 

0-3960802 - 0-0004948 x 0-0022416/0-0039584 
= 0-3960802-0-0002802 
= 0-3958000.* 


Hence 


yi — y_ z 0-3958 

(Ty nJN 0-55 


0-7196. 


Xi — X 

Therefore bis. r = = 0-373. 

yj_-y 0-7196 


This coefficient must be positive since Xi, the mean test score 
of the Passes, is larger than the mean of all the subjects, and 
so must be even larger than the mean of the Failures alone, 


* This method of linear interpolation is not strictly applicable to • 
the table but is sufficiently approximate for the present purpose. 



80 OTHER METHODS OF CORRELATION |j7.iv-7.V 

Hence the Passes have a better score than the Failures ^nd the 
relationship between the test scores and the examinational 
success is a positive one. ^ 


7.V. Fourfold correlation. When both variables are 
dichotomous the methods previously described cannot be 
applied. There are various methods of calculating fourfold 
or tetrachoric correlation coefficients, but their use is not 
advised. The association between two dichotomous variables 
is best investigated by the method described in Section 9.iii, 
p. 99. 


EXERCISES ON CHAPTER VII 

26. (a) Using the Spearman ranking method (formula (21)), 
calculate the coefficient of ranked correlation between the order of 
merit H and tests G and D for the subjects 1-25 in Appendix E. 
(Rank the smallest order of merit as 1 and the largest test score as 1.) 

(6) From the rankings of the subjects obtained above, calculate 
the Spearman coefficient of ranked correlation between tests G and 
D for subjects 1-25. Compare this with the value of r obtained in 
Exercise 20(a). 

27. Calculate the Kendall coefficient of ranked correlation for the 
data of Example 25, p. 70. 

28. Calculate the Kendall coefficients of ranked correlation for 
the same data as in Exercise 26 above. 



Chapter VIII 

REGRESSION AND THE CORRELATION RATIO 

8.1. Graphic construction of regression lines. When 
we are considering the correlation between two variables, X 
and F, we may draw a graph showing the mean values of y 
for regularly increasing values of x. This graph will be an 
irregular line which is called the observed regression line of y 
on X, Similarly the observed regression line oi x on y is an 
irregular line showing the mean values of x for regularly 
increasing values of y. These lines indicate the law of change in 
the mean of one variable for unit change in the other, and if 
the lines are straight the regressions are said to be linear. 

Usually, owing to errors of sampling, the observed regression 
lines are rather irregular, but it may be possible to ‘ fit ’ straight 
lines to them and to show mathematically that the observed 
regression lines do not depart significantly from the fitted 
straight regression lines. 

A straight line has an algebraic equation which represents it 
in symbols; hence linear regression lines may be represented 
by the following equations: 


{y-y) = r^(x- 

•*). 

(25) 


{x-x) = r^-(y- 

■y)- 

(25 A) 


The former is the regression straight line o^yonx and the latter 
the regression straight line oi x on y. In these equations r is the 
coefficient of correlation between X and Y. In (25) x is called 
the independent and y the dependent variable, and vice versa 
*in (25 A). 

The angles these lines make to the horizontal and vertical 
respectively are measured by the expressions 

r ^ and r —, 

and these are called th^ coefficients of regression. 


CSC 


6 



82 REGRESSION AND THE CORRELATION RATI^ [8.1 

If r = 1, the two regression equations are identical and the 
regression straight lines coincide. If r = 0, the two li|:ies are 
horizontal and vertical respectively and cross at right angles. 
The lines cross for any intermediate value of r, so that the 
larger the value of r, the smaller is the acute angle between 
them. The two lines always cross at the point x, y on the 
graph, i.e. the point indicating where the means of x and y lie. 

In any fairly small sample the successive observed means of 
one variable for different values of the other are unlikely to fall 
exactly on a straight line, but if the regression does not depart 
significantly from linearity it is possible to draw a straight line 
which passes very nearly through the observed means. As an 
example, the regression straight lines of the data in Example 
19, p. 68, will be drawn and the discrepancies between the 
observed regression lines and these may be observed. 

At the right-hand side of the correlation table in that ex¬ 
ample we find two columns headed fy and respectively. If 

we divide the entries in the second of these by the corre¬ 
sponding entries in the first, we obtain the mean values of x 
corresponding to each y-group. In Example 19 these are as 
follows: 


y 

fv 


T 

x.y 

77 

-5 

1 

- 6 

-6-00 

-4 

1 

- 3 

-3-00 

-3 

2 

- 6 

-3*00 

-2 

10 

-10 

-1*00 

-1 

12 

- 8 

-0*67 

0 

20 

-16 

-0*80 

1 

22 

-10 

-0*45 

2 

12 

3 

0*25 

3 

8 

8 

1-00 

4 

6 

< 6 

1-20 

6 

4 

11 

2-75 

6 

3 

10 

3*33 


The last column gives the observed line of regression of x on y. 
iiTlike manner the observed line of regression oiyonx may be 
obtained by calculating 9, Ty ^ column from the correlation 
table and dividing the entries in it by the corresponding entries 



8.i] ^BQBESSION AND THE CORRELATION RATIO 83 

in the/a, column. This gives us for the observed regression of 
y on x: 


X 

/x 

T 

v.x 

T 

v.x 

-6 

1 

-5 

-600 

-5 

0 

— 

_♦ 

-4 

3 

-6 

-200 

-3 

7 

-3 

-0*43 

-2 

8 

2 

0*25 

-1 

21 

0 

0-00 

0 

30 

8 

0*27 

1 

15 

30 

200 

2 

9 

29 

3*22 

3 

3 

10 

3-33 

4 

2 

10 

500 

5 

1 

6 

6-00 


* Note that there are no observations in the a? = -«6 column. There 
will therefore be no entry in the last column for this group. Care must 
be taken not to record the entry as 0*00, as this would be taken as a 
point on the observed regression line. 

We now proceed to plot these points on a graph. Since we 
are working from arbitrary origins in this example, the two 
axes will be at right angles, crossing at the point x = 0, y = 0. 
The X line will be horizontal and we mark off along it equal 
divisions corresponding to the a:-groups. Those to the right of 
the origin, or point where the axes cross, will be positive and 
those to the left negative, and they are numbered accordingly. 
Similarly in the case of t/, divisions above the origin are positive 
and those below negative. 

In Fig. 4 the points on the observed regression lines are 
plotted, those for the regression of a; on y being marked by 
crosses and those for the regression oi y onx by circles. The 
regression straight lines themselves, AA and BB, are also 
drawn. It will be seen that the crosses mostly lie closely about 
the line AA and the circles about the line BB. These two 
lines cross at the point where x = —0*21 and y = 0-81, which 
were the mean values of x and y in working units found in 
Example 19. The acute angle between the lines indicatco-a 
fairly high correlation between x and y\ it was found in fact 
that r = 0*666. 


6-2 



f 


84 BBGBBSSION AND THE COBBBLATION EATIO^ [8.1- 

The lines AA and BB are drawn as follows. The regreeision 
coefficients are first calculated from the data r = 0'666, 
(Tj. = 1-818 and = 2-181. / 



Hence = 0-666x 1-818/2-181 = 0-6651 

and = 0-666 x 2-181/1-818 = 0-7990. 

Substituting these in the regression equations, (25) and (25 A), 
(y^-y) = 0*7990(a; — x), 
and (x — x) = 0’5551(2/ —y). 

Now we know that 

x = —0*21 and 0*81, 



8.ii] REGRESSION AND THE CORRELATION RATIO 86 
Hence w« have 

^ y-OSl = 0-7990(a; + 0-21) 
and a; + 0*21 = 0-5551(i/-0-81), 

whence y = 0*7990a: 4-0-9778 

and X = 0-55512/ - 0-6596. 

By substituting different values of x and y in these equations 
and calculating the corresponding values of y and x, the fitted 
regression lines may be plotted exactly. Two points are enough 
to give each line. 

In the first equation, for example, 

ifa: = -5, 1/ = “3-0172 
and if x = 5, y = 4-9728. 

These two points fix the line BB. 

Similarly the line A A is obtained from the second equation: 

if 2 / = -5, a: = -3-4351 
and if 2/ = ^ = 2-1159. 

8.ii. Testing the linearity of regressions. Since the 
method of product-moment correlation is usually appropriate 
only in those cases where the regressions of the two variables 
on each other are linear, it is important to be able to ascertain 
whether in fact the regressions are linear in any particular 
case. Often the graphic method exemplified above suffices, 
but in cases of doubt (and always more satisfactorily) the 
linearity of regression may be tested mathematically. The 
process involves what is called the ‘Analysis of Variance’. 
(See Chapter xi.) 

• Essentially the method is as follows. The student will re¬ 
member that the variance of x is obtained from the sum of the 
squares of the deviations of each individual x from the general 
mean, x. This sum of squares may be split into two portions: 

(1) the sum of the squares of the deviations of each x from 
the mean of its own array, summed for all arrays, and 

(2) the sum of the squares of the deviations of the means of 
the arrays from the general mean, x. In the latter each array 
mean must be weighted by the number of observations in it. 



86 BBGBESSION AND THE COBBELATION B^TIO [8.ii~ 

This splitting of the variance may be shown symbolically. 
Let/p be the number of observations in a y-array corresponding 
to a particular value of x, and be the mean of the y's in this 
array. It may then be shown that 

Siy-yf = S{y^y,r+nf.{y.-yn 

where S means a summation for all different values of x. 

Now if there is a linear regression of y on x, there will be a 
mean y for each array which may be calculated from the re¬ 
gression equation. The quantity —y)*] may therefore 

be subdivided into two portions, one of which is the sum of 
squares due to linear regression and the other the weighted 
sum of the squares of the deviations of the array means from 
the means calculated on the assumption of linear regression. 
If this latter portion of the variance is sufficiently large, it 
signifies that the means of the arrays differ significantly from 
linear regression, and the following section shows how the 
significance of this departure from linearity may be tested 
arithmetically. 

S.iii. In Example 19, ten columns were constructed on the 
right of the correlation table. In order to examine the linearity 
of the regression of 1 / on x for the same data, we need to con¬ 
struct three further columns. The first of these is the Ty ^ 
column, as shown in Section 8.i. The second is obtained by 
dividing entries in this first column by corresponding entries 
in the column: this is headed Ty jf^, Entries in this second 
column are then multiplied by corresponding entries in the 
first to give a final column headed (Ty 

From the sums of certain of these thirteen columns we have 
to calculate three quantities: 

( 1 ) 

(2) 8{y-yf, 

[8(x-x) (y-y)Y 

^ ’ 8{x-xf 
These can be calculated as follows: 

( 1 ) = 



8.iU] REGRESSION ANX> THE CORRELATION RATIO 87 

( 2 ) 8(y-yf = Y,(fyy^)-^^^, 


[iSf(a;-a;) (y-y)P 
S(x — x)^ 


\p^(yT,.y)- 


2(/x^*)- 


Wx^)-^if vyT 

N 

N 


From these data we construct a table. In computing the 
number of degrees of freedom in each part of the variance, a 


is the number of ^/-arrays. 



A. Total sum of squares: — 

Degrees of 
freedom 

N-1 

Moan 

square 

B. Total sum of squares within arrays: 
{A^C) 

N-a 

A-C 

C. Total sum of squares between arrays: 

nuv.-m 

a— 1 


D, Sum of squares due to dewiations from 
linear regression: (C—E) 

a— 2 

C-E 

a — 2 


E. Sum of squares due to linear regression: 1 

S{x^xf 


In this table entries B and D are obtained by subtraction as 
indicated, B by subtracting C from A, and D by subtracting 
E from C, The entries in B and D are then divided by their 
respective degrees of freedom, giving the two entries in the 
colum headed ‘Mean square’. 

*We next find the logarithms of the entries in the mean 
square column and multiply the difference between these 
logarithms by !• 1513 (to convert logarithms to the base 10 to 
Napierian logarithms and to divide by 2). This gives us a 
quantity called z, which is tabulated in Table VI of Fisher’s 
book. (Unfortunately this is the third z we have had to use and 
should not be confused with those of Sections 6.ix and 7.iv.) 


Hence 


1-1513 logy—-log-™^ 


♦ See next section. 



88 REGRESSION AND THE CORRELATION R^TIO [S.iU 

It finally remains to look up the value for z in Fisher’s table 
corresponding to the number of degrees of freedom we have. 
There are two values of n required, and ng. The of the 
table is the number of degrees df freedom corresponding to 
the larger mean square, i.e. whichever of the entries 5 or D is 
the bigger, and ng is the number of degrees of freedom corre¬ 
sponding to the smaller mean square. 

If the calculated z is smaller than the z in Fisher’s 6 % point 
table, then the regression oiy on x does not diiBFer significantly 
from linearity. 

The whole of the necessary calculation is shown in Example 
29. It must be remembered that this is an examination of 
the regression of y on a; only: the regression of a? on y should 
also be examined by an exactly similar method, interchanging 
X and y in each formula. 

Example 29. Examine the linearity of the regression of y on 
X for the correlation table given in Example 19. 

For the sake of space the correlation table itself is not re¬ 
produced here. The first ten columns to the right of the table 
are copied from Example 19, p. 58, and were obtained as 
explained in the section preceding that example. 

From the calculations on p. 89 we may construct the 
analysis table. The number of arrays, a, = 12. 


Degrees of Mean 
freedom square 


A. 

Total sum of squares 

476-39 

99 


B. 

Total sum of squares within arrays 

227-3033 

88 

2-6830 

C, 

Total sum of squares between arrays 

248-0867 

11 


D. 

Sum of squares due to deviations 
from linear regression 

37-2467 

10 

3-7247 

•- 

E. 

Sum of squares due to linear 
regression 

210-84 

1 



* = l-1613(log3-7247-Iog 2-6830) = 0-1830. 



We must now consult Fisher’s 5 % point table of z. We have 
= 10 and rig = 88. 

Now we see from the table that when = 12 and = 120, 
the 5 % point is 0*3032, hence thefValue for = lOandng = 88 



8.iii] REGRESSION AND THE CORRELATION RATIO 89 










CO 



CO 



C' 




lO 



CO 



CO 


o 1 

© 


\ 


00 



CO 


T*< 

CO 



© 

o 

) 

o 


wo 



O 


CO 

O 


CO 

ih 

09 

IM 

6 

o 


6 

cb 

cb 

6 

cb i 

cb 

09 







CO 

5^ 

CO 

W5 

CO 

CO 




CO 





©9 

CO 







00 



© 


09 

CO 







09 

W5 


CO 


09 

CO 




O 

1 

9 


09 


09 

o 

09 

CO 

© 

© 



1 

09 

6 

6 

o 

6 

09 

cb 

cb 

ib 

cb 



I I 


. ,COCO<NOQOO<3JOO<©|^ 

>111 CO r-H ^ 00 


© 

© 

GO 

CO 

09 

fH 

© 

lO 

© 


09 

© 1 

© 


CO 



© 

CO 




CO 

09 

CO 

09 

CO 













CO 


© 

© 

09 


© 


1© 

1© 

GO 

© 

CO 

© 

1© 

© 




09 


<N 


H 

pH 




© 





*^1 I I I I 


Oi 

CO 

A 

II 

CO 

»b 

<£> 


00 

6 

(N 

II 


t- 

«o 

00 

© 

cb 

09 








5Ji 


© 

1 

© 

1 

I 

CO 

1 

09 

1 

1 

© 

»H 

09 

CO 


© 



pH 

© 

CO 

t- 

GO 


© 

© 

© 

CO 

09 









09 

CO 

pH 





© 














r-H 


© 

© 

GO 

© 

09 

© 

09 

GO 

09 

© 

© 

GO 1 

I " 


09 


rH 


rH 


09 


t-- 

CO 

© 

© 














f-H 1 

1 © 


© 


© 

© 

09 

1 *> 

109 


Tfi 

© 

© 

GO ' 

100 

t>* 1 pH 




09 



09 

09 

09 

09 

09 

' 

09 

\gO 

1 

1 

1 

1 

1 

1 1 

1 






1 

1 1 


»o 


O<N00OX O 0©^'^tCi0 
CO^^(M <N(N»0©Tt< 

I 


lO 


3 

CQ 


** IS 
fh o 09 

00 11~< 

I 


l> 

09 


S. § 

09 ^ 


U5 

CO 

CO 


I5>i 

I 

3 

ii? 

I 

H 


00 1*-^ 

I 

CO 

a 

CO 

cb 

CO 

II 


I 

N 

3 

w 




I I I I I 


»0'<*<e009^ O f>-iO9C0^UD© 

^ I I I I i 


.-H»^e9O09O 090900U5'^C0l 
i-H I-H 09 09 I*H 




90 REGRESSION AND THE CORRELATION RATIO^S.iii- 

must be bigger than this. Accordingly our calculated value 
of z is much smaller than the 5 % point and we conclude that 
the regression of y on a: does not depart significantly from 
linearity. 

8 .iV. The method of calculating z at the end of the previous 
section was included for the sake of theoretical completeness. 
In actual practice the rather cumbrous calculation may be 
avoided by the use of a table given in Fisher and Yates’s tables 
(Ref. 6). Instead of working out the value of z we may calculate 
the variance ratio. In testing the linearity of regression by this 
method, what we wish to show is whether or not the variance 
due to deviations from linearity of regression is significantly 
greater than that within arrays. If the former variance is 
smaller than the latter, there is no need to make a test at all— 
there is no evidence of departure from linearity of regression. 
(The student need not concern himself with the very rare case 
whete the variance due to deviations from linear regression is 
significantly smaller than the variance within arrays.) 

The variance ratio in this case is the entry in the ‘Mean 
square ’ column in the D row divided by that in the B row. If 
this ratio is greater than 1 we must consult Table V for the 
5 % point in the Fisher and Yates tables. For this purpose 
in the table is the number of degrees of freedom corresponding 
to the larger variance, i.e. that in the D row. If the calculated 
ratio is smaller than the appropriate entry in the table, then 
the regression does not depart significantly from linearity. 

Taking the data of Example 29, for instance, the variance 
ratio is 3*7247/2’6830 = 1*44. Inthiscaserii = lOandTig = 88. 
In Table V of Fisher and Yates for = 12 and rig = 120 we 
find an entry of 1 • 83 for the 5 % point. Our calculated variance 
ratio is definitely smaller than this and so would be relatively 
even smaller than the one corresponding to the number of 
degrees of freedom in our data. (The correct 5 % point ratio 
could be obtained from the table by interpolation, but there 
iiSTio need for this.) We conclude, therefore, that the regression 
dops not depart significantly from linearity, which agrees with 
our findings in the previous section. ^ 



8.Vi] REGRESSION AND THE CORRELATION RATIO 91 


8.V. The correlation ratio. If the regressions of the two 
variaV‘ 6 S on each other are non-linear, the degree of associa¬ 
tion between the variables cannot be measured by ordinary 
correlation. There might be a real relationship and yet the 
coefficient of correlation would be zero if the regressions were 
semicircular and symmetrical, for example. In such cases a 
measure of association may be obtained by calculating the 
correlation ratio. 

There are two correlation ratios for each pair of variables 
and they are denoted by and The formulae for these 
are simple: 

crl ~S(y-yf ’ 


(26) 




S(x-x)^ 


(26 A) 


In these formulae signifies the standard deviation of the 
means of the a:-arrays and (r:g^ the standard deviation of 
the i/-arrays, so that rj may be seen to be the ratio between the 
standard deviation of the means of arrays and the standard 
deviation of the whole sample. 

It will be seen that the requisite data for the calculation of 
Tjyy. have already been obtained in Example 29, and if the 
student has examined the linearity of the regression of x on 
y he will also have the data for calculating T\y.y, 


Example 30. Calculate Tjy^ from the data obtained in 
Example 29. 

From that example we see that 

= 248-0867 and > 8 ( 1 /-= 475-39. 


Hence 

whence 


248-0867 

”47^39“ 

Vvx = V^'fi2186 = 0-7224. 


= -475:39- = 


8 .vi. The data necessary for the examination of the 
linearity of regression may now be expressed in an alternative 
form. We have s{y - yf = N^l, 


'^Jxiyx-yY'\ = 
[8{x-x)^^y)f _ 
S(x-xf 





92 REGRESSION AND THE CORRELATION RATIO [8.Vi 
Also the z of Section 8iii may be expressed as •» 


z = l-15131og 


(N-a) — 


^ .® (a^2)(l~^2) • 

If we substitute in this expression the values we have obtained, 


we find that 


2 ; = 1-1613 log 


88(0-72242-0-6662) 

10(1-0-72242)““ 


= l-1613log 


6-89076 

4-7814 


= l-1613log(l-4412) 
= 0-183. 


This result is identical with the value of z calculated in 
Example 29. 

It may also be observed that when tj and r are equal in the 
above expression, z is zero, i.e. when the regressions are abso¬ 
lutely linear, both ^’s are equal and equal to r. 


EXERCISE ON CHAPTER VIII 

29. (a) From the table used in Exercise 22, examine the linearity 
of the regressions of tests F and Q on each other. 

(6) Thence calculate the two correlation ratios for tests F 

and G, 



' Chapter IX 

X^iCONTINGENCYiGOODNESS OF FIT 

9 .i. The statistical methods already described have mostly 
been applicable to quantitative numerical data only, and 
usually only to data which are at any rate approximately 
normally distributed. It may happen, however, that the 
available data are qualitative or quantitative only in the sense 
that we know the number of cases falling into different cate¬ 
gories. In such instances the methods which have been 
explained in the previous chapters cannot be used but there 
are other methods which are appropriate. 

These methods depend chiefly on a statistic known as 
The mathematical derivation of this statistic is difficult and 
cannot be described here. However, the distribution of it has 
been worked out and tables are available showing the fre¬ 
quency with which different values of are exceeded and 
also the values of corresponding to particular frequencies. 
Reference to the appropriate tables will be made later in 
this chapter. 

Use may be made of in the investigation of a number of 
different problems, but the calculation of it is essentially the 
same in each case. If 0 is the observed frequency in a particular 
category into which a variable may fall and E is the frequency 
which would be expected to fall in that category on some 
hypothesis, then may be found by dividing the square of the 
(^fference between 0 and EhyE and summing these quotients 
for all categories into which the variable falls. In symbols, 

= (27) 

This is the general formula and the calculation of it in 
specific cases will be decribed in due course. 

Having found we need to know the number of degrees of 
freedom available for calculating it in each particular case 



94 ;\;®:contingency:goodness of fit [9.i- 

before we can make use of the tables. Rules for finding the 
number of degrees of freedom will be given in ejich instance. 
Consultation of the tables will then indicate th^ probability, 
P, of a calculated value of being exceeded as a result of 
random sampling. If this probability is less than 0*05 (i.e. 
19:1), then may be regarded as showing that the observed 
data depart significantly from the hypothesis which is being 
examined. This hypothesis may be that a variable has a par¬ 
ticular type of distribution or, more frequently, that there is 
no association between two variables. This latter hypothesis, 
known as a null hypothesis^ assumes that two variables are not 
associated: if we can disprove a null hypothesis, it follows that 
the two variables must be associated. Such a procedure is 
usual in the investigation of certain problems of association 
which will now be described. 

9.ii. Contingency. Suppose x and y are two variables 
which are not measurable numerically but which can each be 
divided into two or more categories. An example of this might 
be hair colour and eye colour, for instance. Let the different 
categories of x be etc., and those of y be y^, j/ 3 , etc. 

We may then construct a contingency table showing the num¬ 
ber ofx^s which fall into the y^, y^, 1/3, etc. categories, and so on. 
This will give us a sort of small correlation table. An example 
is shown below for 5 ^-categories and 4 y-categories. 




^2 

Xa 



1 

2/1 







Vi 







2/3 







2/4 







1 


Ux 



Wx. 

N 


It will be seen that the table is divided into 6 x 4 = 20 rect¬ 
angles or cells. The total frequency in each a;-category is given 
at the foot of the columns and will be n^^^ etc. Similarly 
the total frequency in each j/-category is given at the end of the 



95 


9.ii] ;t^:CONTINGENCY:GOODNESS OF FIT 

horizontal*rows and will be etc. The total frequency of 

each variabte, or pairs of variables, will as usual be N. In 
general, if thl3re are r rows and c columns, there will be a; x c 
cells. Each cell may be refei*red to by the x and y categories 
into which it falls: e.g. the cell on the third row down and in the 
second column from the left may be referred to as the Xg, 2/3 
cell. 

The contingency table is completed by filling in the observed 
frequency in each cell. For instance, we count how many 
observations in the category are also in the category and 
write down the number in the x ^, y-^ cell, and so on. From the 
completed table we may calculate a coefficient known as the 
coefficient of mean square contingency and denoted by C, To do 
this we have first to calculate the frequency which would be 
expected in each cell on the null hypothesis, i.e. on the assump¬ 
tion that the two variables are not associated with one another. 
This is done quite simply as follows: if n^^ is the total frequency 
in the column, and ny^ the total frequency in the y^ row, then 
the frequency that would be expected in the x^,, y^, cell on the 
null hypothesis is 

By substituting each of the values of c and r in turn, wo obtain 
the expected frequency in every cell. 

The next step is to subtract the expected frequency from the 
observed frequency in each cell: the resulting values of (0 — E) 
are written in each cell with due regard to sign, and the arith¬ 
metic may be checked at this stage by observing that the total 
of (0 — E) is zero for each row and each column. 

^ Next square (0 — E) and divide the square by E for each cell. 
This gives a series of quotients which is conveniently written 
down on the right of the table: there will be r x c such quotients. 
Finally this column of figures is added, giving 

From this the coefficient of contingency, (7, is given by the 
formula 

(7 = 

[Sometimes an intermediate jStep is inserted in the definition 


y 


Jr 

N+Y' 


(28) 



96 ;^2:contingenoy:goodnbss op fit [9.ii 

of C. is called the ‘square contingency’ and if we divide it 
by N we get the ‘ mean square contingency whifth is denoted 


by Then 


-h 




+ </> 


2 * 


(28A)] 


If the null hypothesis is correct, 0 and E will be equal for 
each cell (apart from errors of sampling), so that will be zero 
and C will also be zero. It is evident from formula (23) that C 
can never quite equal unity. The actual maximum value of G 
depends on the number of cells, so that for a 2 x 2 table, for 
example, the maximum possible value of C is 0*707, and for a 
10 X 10 table the maximum for C is 0*949. It is obvious, there¬ 
fore, that a value of C obtained from one contingency table 
cannot be directly compared with that from another, unless 
the number of rows and columns in the one is equal to that in 
the other. Hence in reporting a value of C the number of cells 
in the table should always be mentioned. Moreover, C is 
always positive, so that the nature of the association has to be 
determined by inspection of the table. It may be seen, there¬ 
fore, that C itself is not a very useful coefficient. Of much more 
use is the information to be derived from x^- 

The standard error of C is exceedingly complex and can only 
be interpreted for very large samples. Accordingly it is seldom 
used and instead we find from the x^ table the corresponding 
value of P, which gives the probability of our calculated value 
of x^ being exceeded as the result of random sampling. There 
are two usual methods of doing this, making use of a table 
given by Fisher or one due to Elderton. Fisher’s method is 
probably the more convenient and will be described first. 

(1) First we need to know n, the number of degrees of free¬ 
dom available for calculating x^- fhe case of a contingency 
table of r rows and c columns this is given by 71 = (r— 1) (c— 1). 
We may then consult Fisher’s table, of which an extract is 
given below, noting the value of x^ appearing against this value 
of n in the column headed P = 0*05. If the calculated value of 
is greater than that given in the table, then it is significant, 
and the null hypothesis is disproved, i.e. there is a significant 
association between the variables. 



9.ii] ;t2.QQjjTINOENCY:GOODNESS OF FIT 97 

An e^ftract from Fisher’s table (Ref. 2, Table III) or Fisher 
and Yates’s (Ref. 5, Table IV) for P = 0*05 is given below. 


Ta:Sle VII. Table of for different values of n 
P = 006 


n 


! n 


n 

X^ 

1 

3-841 

11 

19-675 

21 

32-671 

2 

5-991 

12 

21-026 

22 

33-924 

3 

7-816 

13 

22-362 

23 

35-172 

4 

9-488 

14 

23-685 

24 

36-416 

6 

11-070 

15 

24-996 

25 

37-652 

6 

12-692 

16 

26-296 

26 

38-885 

7* 

14-067 

17 

27-587 

27 

40-113 

8 

15-507 

18 

28-869 

28 

41-337 

9 

16-919 

19 

30-144 

29 

42-567 

10 

18-307 

20 

31-410 

30 

43-773 


If n is greater than 30, significance can be tested by calculating 
V(2A!*) ” !)• R this exceeds 1*65, significant. 

(2) To make use of Elderton’s table, which is given in 
Tables for Statisticians and Biometricians (Ref. 3, Table XII), 
we need to know n\ which is one more than the number of 
degrees of freedom, i.e. n' = (r — 1) (c — 1) -f 1. Different values 
of n' are given at the heads of columns in Elderton’s table and 
integral values of ^ are listed at the side. By looking up the 
value of P in the appropriate n^ column opposite the calculated 
value of xf, we obtain the probability of as great a value of xf 
or greater occurring as a result of random sampling; if this 
value of P is less than 0*05, xf is significant. Since the listed 
values of xf are integral, it is usually necessary to interpolate 
to find the value of P. 

The use of both tables is illustrated in Example 31. 

• In using the method of contingency there are two provisions 

to be borne in mind. First, E, the expected frequency, must 
be at least 5 in each cell; secondly, the table should if possible 
contain at least 6 rows and 6 columns. The former provision 
is the more important. 

Example 31. Two examiners assessed the intelligence of 200 
students, one by a verbal test and the other by a performance 
test. Each graded the intelligence as Very Good, Good, Fair 

CSC, 7 



98 ;^*:contingency:goodness of fit [9.ii- 

or Poor. The relation between the two sets of judg^ients is 
shown in the contingency table below. Calculate thejcoefficient 
of contingency and examine whether the relationsl^ip between 
the two judgments can be regarded as significant or not. 

Examiner 1 



r 





V.G. 

G. 

F. 

P. 


rv.G. 

19 

10 

8 

3 


G. 

7 

40 

9 

4 

Examiner 2 - 

F. 

8 

20 

23 

19 


P. 

0 

8 • 

12 

10 


First we construct a contingency table, leaving room for 


three entries in each cell. Then in each cell we enter the ob¬ 
served frequency, the expected frequency and the difference 
between the two, thus: 


0-E 

0 

E 



V.G. 

G. 

F. 

P. 

{O-E)" 

. tn 

V.G. 

12-2 

19 

6-8 

-6-6 

10 

16-6 

-2*4 

8 

10-4 

-4*2 

3 

7-2 

0 21-888 
2-010 
0-664 

G. 

-3*2 

7 

10-2 

16-6 

40 

23-4 

-6-6 

9 

15-6 

-6*8 

4 

10-8 

0 2-‘t60 

11-776 

2-792 

F. 

-3*9 

8 

11-9 

-7-3 

20 

27-3 

4-8 ’ 
23 

18-2 

6-4 

19 

12-6 

0 4*281 

70 1*278 

70 1*962 

1-266 

P. 

-61 

0 

6-1 

-3*7 

8 

11*7 

4-2 

12 

7-8 

4- 6 

10 

5- 4 

0 3*261 ^ 

30 6*100* 

30 1*170 

O.OQ^ 

t 

0 

34 

34 

0 

78 

78 

C = 

0 

52 

1 52 

1 ^ 
In+x^ 

0 

36 

1 36 

/ 66*962 
V 266*952 

3*918 

200 - 

= 66*952 

j 


= 0-501. 



9.iii] ;\;2:contingbncy:goodness of fit 99 

Along the bottom and at the side of the table the arithmetic 
is checked by showing that the total of the expected fre¬ 
quencies for each row and each column is the same as the total 
of the observed frequencies, and also that the sum of (0 — E) 
is zero for each row and each column. Then for each cell we 
calculate (0 — E)^IE and tabulate the values obtained on the 
right of the table. The sum of these is The remainder of the 
calculation of C is shown below the table. 

We shall assess the significance of this result by both 
methods. 

(1) Using Fisher’s table: 

n = (4-l)(4-l) = 9. 

From Table VII, for n = 9,x^ = 16-919. The calculated value 
of x^ is much bigger than the value in the table, hence the dis¬ 
tribution departs significantly from independence, i.e. there 
is a significant relation between the two sets of judgments. 

(2) Using Elderton’s table: 

n ' = (4-1) (4-1) + ! = 10. 

For n' = 10, opposite x^ = we get a value of P = 0-000000. 
This means that a value of x^ big or bigger than the one 
calculated would arise as a result of random sampling less than 
once in a million times. Accordingly there is no doubt about 
the significance of the relationship between the two sets of 
judgments. 

Note that Fisher’s method is the easier for proving whether 
or not there is a significant relationship in a single table. If 
we wish to compare the significance in two or more tables we 
need to know the value of P for each; in this case we have to 
use Elderton’s table or else the complete x^ tables given in 
•Fisher’s book. 

9.iii. 2 x2 tables. A special form of contingency arises 
when both variables are dichotomous, i.e. each variable can be 
divided into only two classes. The association between such* 
variables may be shown by constructing a contingency table 
with only four cells, since there will be only two columns and 
two rows. We shall now examine the significance of a fourfold 
or 2 X 2 table making use of the x^ method. 


7-2 



100 


;^^:contingency:ooodness of fit [9.111- 

In order to simplify the notation we shall denote the ob¬ 
served frequencies in the four cells by a, 6, c and dras under: 



1 

^2 


Vi 

a 

6 

. 

o “f" 5 

2/2 

c 

d 

c d 


a-fc 

b + d 

N 


It will be seen that N = a-f 6 + c-f d. 

The value of iiiay be determined in the same way as that 
used for any contingency table, but the arithmetic may be 
shortened by using the formula 

V* =__. (29) 

^ {a-\-c)(b-\-d)(c + d){a-{-b) ^ ' 

The denominator of this will be seen to be the product of the 
totals of the rows and columns. 

.. Having found x^> its significance may be determined by con¬ 
sulting Fisher’s table for = 1. It will be seen that if the 
calculated value of x^ is as great or greater than 3-841, then 
there is a significant association between the two variables. 

Alternatively, the value of P may be found by consulting a 
special table given by Yule (Ref. 1, p. 534). Reference to this 
table will show that for a value of x^ equal to 3-84, P = 0-06. 
This, of course, agrees with Fisher’s table, but if exact values 
of P are required for purposes of comparison they may be 
found from Yule’s table. 

As in the case of all contingency tables, the nature of the 
association in a 2 x 2 table has to be determined by inspection. 

Example 32. Is there a significant association between X ^ 
and Y in the following data? 

X 





Total 

Vi 

65 

25 

90 

y% 

20 

50 

70 

Total ! 

85 { 


160 





101 


9.v] ;\;®:oontingbnoy:qoodnbss op pit 

From formula (29) 

’ .2 _ (65x50-25x20)2 X 160 

' ^ ~ 85x75x70x90 

This may readily be evaluated by logarithms and we find that 

= 30-13. 

This value is much greater than 3*84, the critical significance 
value of for one degree of freedom, so that we may conclude 
that there is a significant relationship between X and 7. 

9.iv. Yates’s correction. The distribution is a con¬ 
tinuous one but in 2 x 2 tables the number of sets of observa¬ 
tions which can fit the marginal totals is finite and limited, 
so that the actual distribution from such a table is definitely 
discontinuous. Allowance for this fact may be made by 
applying Yateses correction for continuity. Essentially this 
consists of decreasing by \ those cell values which are greater 
than expectation and increasing by \ those which are less 
than expectation. This has the effect of slightly decreasing 
the value of 

Applying this correction, formula (29) becomes 

From the data in the previous Example the student may 
verify that the value of applying Yates’s correction is 28-4. 
With such a large value the correction has not altered signi¬ 
ficance, but with borderline cases, where y? is only just signi¬ 
ficant as calculated by formula (29), the correction should 
always be applied since it may very easily bring the value below 
the significance level. 

9.V. Alternative method of calculation . An alternative 
method of calculating from a 2 x 2 table, which may be used 
as an arithmetical check on results obtained from formula (29),^ 
is as follows: 


^od-6c- 

(a + c) (b + d) {c + d) (a + b)' 



102 


X*:contingbncy:goodness of fit [9.v- 

Let Pi be the proportion of which is in the class; 

P2 he the proportion of which is in the x^ class; 
p be the proportion of the total population in the x^ 
class; and ' 

g = 1 -p. 

mu a c a + c 

Then p, =-;; p„ =- p = . 

a+b c+d ^ N 


is then given by the formula 


^ Pia+p^e-pja + c) 
^ M 


(31) 


The algebraic proof of the identity of this expression with that 
given in formula (29) is quite simple and is left to the student 
as an exercise. Note that Yates’s correction cannot be applied 
with this method of calculation. 

Applying this formula to the data of Example 32 we have: 

Pi = 66/90, P2 = 20/70, p = 86/160 and q = 76/160. 


Hence 


65 X 66 20 X 20 85 x 85 

90 ■*“ 70 l'60 

85 75 

160^ 160 


_ 46-944-f5-714-45-156 
” 0.249 

= 30-13. 


This result can be seen to be identical with that obtained in 
Example 32. 

(Note. For a way of treating 2x2 tables by the method of 
rank correlation see J. W. Whitfield, Biometrika^ vol. xxxiv, 
December 1947, pp* 295-6.) 


' 9.vi. 2 X n tables. Cases frequently occur where we wish 
to examine the association between two variables, one of 
which is dichotomous whilst the other is capable of division 
^ into several, say n, classes. In such cases b® calculated 

by the method of Section 9.ii, b]it the calculation of the 



103 


9.vi] ;\;2.qontingency:goodness of fit 

expected frequencies may be avoided in the following manner. 
Suppose the 2xn table is represented as under, y being the 
dichotomijus variable and x the variable with several categories 
—four in the example. The observed frequencies in the cells 
are denoted in each case by the letter a with appropriate 
suffixes and dashes: 





^3 

0^4 


Vi 

ft 

^2 

«3 


Wl 

2/8 

< 






tti + a' 

^2 + ^2 

®3 “1" 


N 


If we take a and a' to represent any associated pair of 
observed frequencies, then we calculate for each pair 


1 _ 


(an^ — 


where Wj and are the total frequencies in the and y^ classes. 
This expression is evaluated for each pair of associated 
frequencies and the sum of all such expressions is divided by 
(% X ng), giving Hence 




S 


\a-\-a 


1 




(32) 


The number of degrees of freedom in a 2 x table is , so 
that the significance of in the above case maybe determined 
by consulting Table VII for n = 3. 


Example 33. Determine the significance of the association 
between the variables x and y in the following contingency 
table: 




^2 

a:^3 

71 

^4 

Total 

Vi 

36 

42 

30 

12 

120 

2/a 

12 

22 

20 

26 

80 

Total 

48 

64 

1 _> 

50 

1 

38 • 

200^, 





104 ;\;2:contingency:goodnbss of fit [9,vi- 

1 


Calculating the expression 


(an« —to the nearest 
a + a'^ * f 


whole number for each pair of associated frequencies, we have 


48 


(36x80-12x 120)2 = - 


2,073,600 


-(42x80- 


•22x 120)* = 


48 
618,400 


43,200, 


64 


i (30x80-20x120)* = - 


= 8 , 100 , 


0 , 


^ (12 X 80 - 26 X 120)* = = 122,779. 

OO uo 

The sum of these quantities = 174,079. Therefore 

174,079 




= 18-1. 


120x80 

In Table VII for P = 0*05 and w = 3 we find a value of of 
7*816. Our calculated value is much greater than this and we 
may therefore conclude that there is a significant association 
between the two variables. 

An alternative method of calculation, similar to that given 
for 2 X 2 tables, may be used as a check. If p is the proportion 
in the class in any vertical column, i.e. p = a/(a + a'), and 
P is the proportion in the class in the whole population, so 
that P = n-^jN and Q = 1 — P, then 

(33) 

To calculate this, first work out a*/(a-t-o') for each vertical 
column and then substitute in the above formula. We obtain: 

36*/48 = 27-0 
42*/64 = 27-6626 
30*/60 = 18-0 
12*/38= 3-7896 


*By substitution 


)S[a*(a+a')] = 76-3620 

, 200* 120*\ 
^ “ 120x80V®’® 200/ 

= 4-1667 X 4-362 
= 18-13. 





y 

105 


9*vii] x^:oontingency:goodness of fit 

9.vii. Goodness of fit. One other use of the method 
may be .given. Suppose we have an observed frequency dis¬ 
tribution^ of a variable and wish to examine the validity of 
some hypothesis about that distribution. We may do this by 
calculating what the distribution would be on that hypothesis 
and examining the agreement, or goodness of fit, of the observed 
and calculated distributions. 

As an example, an observed frequency distribution of acci¬ 
dents will be given and an examination made of the hypothesis 
that the accidents are distributed in a Poisson series. If the 
probability of an event occurring is very small but a long 
enough period of observation is taken for it to happen some¬ 
times, so that it may happen 0, 1, 2, 3, ... times, then the dis¬ 
tribution of its occurrence will form what is called a Poisson 
series, if a large enough number of independent observations is 
made. The frequency distribution for 0, 1,2, 3,... occurrences 
will be given by the successive terms of the series 


I 


\ 


where N is the total number of observations and m is the mean 


number of occurrences. 


Example 34. Examine the hypothesis that the following 
frequency distribution of accidents forms a Poisson series; 


Number of 
accidents 
0 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


Observed 

frecjuoncy 

14 

37 

76 

70 

64 

63 

31 

19 

14 

9 

6 

6 

1 


398 



106 


;^*:contingency:goodness of fit [9.vii 


In order to calculate the expected frequency on the assump¬ 
tion that the distribution forms a Poisson series need to 
know m, the mean number of accidents. This is reacjily calcu¬ 
lated by the method of formula ( 2 ), Section 2 .ii. We find that 
S(/X) = 1549, whence the mean number of accidents is 
1549/398 = 3*892. Substituting this value for m and 398 for 
N in the above formula for the Poisson series, we obtain the 
expected frequencies as below (see later for method of calcula¬ 
tion): 


Number of 
accidents 
0 
1 
2 

3 

4 
6 
6 

7 

8 
9 

10 
11 
12 ^ 

13 

14 j 


Expected 

frequency 

8-12 

31-61 

61-51 

79-80 

77-64 

60-43 

39-20 

21-80 

10-60 

4 - 69 ] 

1-78 


0-06 

0-02 


397-99 


In order to satisfy the provision that in calculating the 
expected frequencies shall not be less than 5, the last six 
entries are grouped together. Since the expected frequencies 
have been calculated to only 2 places of decimals, the total of 
the expected frequencies is not exactly 398, but it is so very 
little less that it can have no marked effect on the value of 
Calling the observed and expected frequencies 0 and E respec¬ 
tively, we now construct columns headed (0 — E), (O — E)^ 
and {0 — EYjE. The total of the last column is All six 
columns are given below. 

We have used ten groups to calculate but the number of 
degrees of freedom is only 8, since both N and m were fixed in 
Ihe formation of the Poisson series.^ For n = 8, we find that 



107 


9.vii] x*:contingency:goodness of fit 

= 15*507 in Table VII, p. 97. The calculated value is much 
greater than this, hence the Poisson series is a bad fit to the 
observed j^ccident distribution. We find, in fact, from Elder- 
ton’s table that for such*a value of P is less than 0*001, 
and so we may conclude that the observed distribution 
certainly does not form a Poisson series. 


No. of 
accidents 

0 

E 

(O^E) 

1 

O 

(0-E)^ 

E 

0 

14 

812 

5-88 

34-5744 

4-26 

1 

37 

31*6J 

5-39 

29-0521 

0-92 

2 

76 

61-61 

14-49 

209-9601 

3-41 

3 

70 

79-80 

- 9-80 

96-0400 

1-20 

4 

64 

77-64 

-13-64 

186-0496 

2-40 

5 

53 

60-43 

- 7-43 

55-2049 

0-91 

6 

.31 

39-20 

- 8-20 

67-2400 

1-72 

7 

19 

21-80 

- 2-80 

7-8400 

0-36 

8 

14 

10-60 

3-40 

11-5600 

1-09 

9 and over 

20 

7-28 

12-72 

161-7984 

22-23 




O-OI 


38*60 


(Note. The total of the (0 — E) column should as usual be 
zero. It is not exactly so in our case since the expected fre¬ 
quencies were calculated to only two places of decimals.) 


Method of calculating the expected frequencies 

The calculation of the expected frequencies in the above 
example is best done by using logarithms. In the Poisson series 
‘the term Ne ~”*’is common to each successive frequency, so this 
is worked out first: 

log 398 = 2*59988, 

logc = 0*43429, 

log log e = 1*63778, 

log 3*892 = 0*59018; 

hence 3*892 log e = antilog (1*63778+ 0*59018) 

= 1*69026; 

therefore 

log398e-3'8®2 = 2*59988- 1*69026 
= 0*90962, 

= § 12 . 


whence 398 e‘“ 2*®*2 



108 


;\;2:contingency:goodness qf fit [9.vii~ 

The values of the successive terms may then be found by a 
tabular method, making use of the logarithms and logarithms 
of factorials given in Fisher and Yates’s tables (Ref 6, pp. 62, 
76). 

The method is indicated below, p is the index of the power 
to which m is raised in the successive terms: 


\ogm^ 

logpl 

logm** —logp! 

(logm® —logp!) 
+ logjVe“’” 

antilog 

0-69018 

0-0 

0-69018 

1-49980 

31-61 

1-18036 

0-30103 

0-87933 

1-78895 

61-51 

1-77064 

0-77815 

0-99239 

1-90201 

79-80 

etc. 

etc. 

etc. 

etc. 

etc. 


9.viil. Curve fitting. Frequently in biological experi¬ 
ments a series of observations or groups of observations is 
obtained and it is wished to examine the trend of the series. 
A graph of the series may indicate to the eye what shape the 
trend appears to have, but it is often important to determine 
mathematically what is the algebraic equation of the line that 
best describes the trend. For example, a group of subjects is 
repeatedly tested with the same test and we may wish to define 
the shape of the improvement curve obtained, or we may wish 
to find the shape of the growth curve of experimental animals 
fed daily on a certain diet. Such a process is known as curve 
fitting, and in many instances the process may be reduced 
mathematically to a method similar to that of fitting a straight 
regression line to the data. (See Chapter vm.) 

In the simplest case the data appear to lie on a straight line, 
and what is done in such a case is to calculate from the data 
the equation of the best fitting linear regression line and then 
check the goodness of fit by the method of analysis of variance. 
Now the algebraic equation of a straight line is y = ax+ b. 
This is equivalent to the linear regression of y on x. The con¬ 
stant a indicates the slope of the regression line and b the 
height of the line above the base. Hence a is equivalent to the 
regression coefficient of Section 8.i., i.e. 

a=^r~ and b — y — ax. 

i 

If our data consist merely of a seriep of N means we have in- 



109 


9.viii] ;\;*:contingency:goodness of fit 

sufficient material for calculating r and (Xy. However, we know 
from formula (14 A) that 


Hence a =^r-^ = —^s- 


8{XY)^NXY 


This expression may be calculated directly from the data, 
since we know the exact values of x, the independent variate. 
The method of calculation is simple and is illustrated by the 
following example. 

Example 35. The same test was given 10 times to a group of 
subjects and the mean scores obtained were as follows: 

Test 123 456789 10 

Mean score 11 14 17-6 20*5 23 26-5 28 32 35-5 37-5 


A graph of these means suggests that they lie roughly on a 
straight line. The line will be the regression of test score on 
order of testing so that we will call test order x^ the independent 
variate, and mean score y, the dependent variate. We first 
calculate the coefficients a and b in the equation y = ax+ b and 
then proceed to examine how well a line with this equation fits 
the data. 

Five columns of figures are constructed as in product- 
moment correlation. These are headed x, y, x^, and xy 
respectively and each column is summed. In order to reduce 
the arithmetic, the y entries may be taken from an arbitrary 
origin, say 25 in this case. The table at the top of p. 110 
shows the columns and totals obtained. 


Hence x = 5-5, 

Nxy = x,S{y) = 5-5 x —5-5 = —30-25, 

Nx^ = x.8(x) = 5-5 X 55 = 302-5, 

_ 8{xy )-N^ _ 2 13 + 30-25 ^ 

~N<tI 385-302-5 

y = -0-55, 

6 = y = p 0-55 - 2-948 x 5-6 = -16-764. 



110 ;^2.0ontingency:goodness of fit [9.viii- 


X 

y 

a;* 


xy 


1 

-14 

1 

196 

-14 

r 

2 

-11 

4 

121 

-22 

N = 10 

3 

- 7*6 

9 

66*25 

-22*6 

S(x)L 66 

4 

- 4*6 

16 

20*26 

-18 

S(y) - - 6-6 

5 

- 2 

25 

4 

-10 

S(x^) = 386 

6 

0*6 

36 

0*26 

3 

<S'(y*) = 722-26 

7 

3 

49 

9 

21 

S{xy) = 213 

8 

7 

64 

49 

66 


9 

10*6 

81 

110*26 

94*5 


10 

12-6 

100 

166*26 

125 


65 

- 6*6 

385 

722*26 




This value of b requires correction since y was taken from an 
arbitrary origin. This is done by the addition of 25, so that the 
true value of 6 is 26— 16*764 = 8*236. Hence the straight line 
which best fits the data is given by the equation 
y = 2*948a;+8*236. 

By substituting values of x from 1 to 10 in this equation we 
may calculate the best fitting theoretical values of y and 
compare these with the observed values, as under: 


X 

Observed y 

Calculated y 

1 

11 

11*184 

2 

14 

14*132 

3 

17*6 

17*080 

4 

20*5 

20*028 

6 

23 

22*976 

6 

26*6 

26*924 

7 

28 

28*872 

8 

32 

31*820 

9 

36*5 

34*768 

10 

37-5 

37*716 


It is obvious by examination that the observed values lie 
very closely about the calculated regression line. The signi¬ 
ficance of the fit may be tested mathematically by an analysis 
of the variance. (See Chapter xi.) For this purpose, the 
variance of y is split into two parts, that due to linear regres¬ 
sion and that due to departures from linear regression. Since 
we have only mean values of y, the variance of y can be esti¬ 
mated with only JV — 1 degrees of freedom, in this case 9. The 

total sum of squares about the mean is given by 8(y^) -—™. 



Ill 


9.ix] ;i^2:OONTINOENCy IGOODNESS OF FIT 

The part of the variance due to linear regression is given by 
multiplying the variance of x by i.e. it is given by 



Subtraction of this from the total sum of squares gives the 
part of the variance due to departures*from linear regression. 

We get in this example, therefore, the following analysis 
of variance: 






Mean 




* Sum of squares 

D.F. 

squares 

v.R. 

1. 

Total 

722-25-3-025 = 719-225 

9 

— 

— 

2. 

Due to linear re¬ 

82-5x2-9482 = 716-983 

1 

716-983 

2560- 


gression 

■ 




3. 

Due to departures 

2-242 

8 

0-280 



from linear regression 

The sums of squares for lines 2 and 3 are divided by their 
respective number of degrees of freedom (d.f.) to obtain the 
mean squares or estimates of the variance. Finally the variance 
ratio (v.R.) is obtained by dividing the estimated variance due 
to linear regression by that due to departures from linear 
regression, giving the ratio of 2560*7. From Fisher and Yates’s 
tables (Ref. 5) we find that for Ui of 1 and rig of 8 the critical 
value of the variance ratio at the 5% point is 11*26. The 
observed value of 2560*7 is greatly in excess of this, which 
means that almost the whole of .the variance of y is due to 
linear regression and the part due, to departures from linear 
regression is negligible. 

9.ix. Logarithmic curves. It is sometimes obvious that 
a straight regression line will not be a good fit to observed data. 

Example 36. A subject was given a sensori-motor test 12 
consecutive times and his scores were: 

2*0, 3*3, 4*0, 4*5, 4*7, 5*0, 5*5, 5*6, 5*9, 6*0, 6*1, 6*3. , 

If these scores are plotted against order of testing, a line is 
obtained which is evidently not a straight line. The line rises 
fairly steeply at first and gradually flattens out, which suggests 
that some form of logarithmic curve might fit the data besf. 



t 

112 


j^:CONTINGBNCY:GOODNBSS OF FIT [9.iX- 


The simplest forms of logarithmic curves are given by the 
following equations: y = a\ogx+b, 

logy = aloga;V6. 


To discover which curve gives the best fit to the data it is 
necessary to plot each one of them, giving x the values of 1 
to 12, and selecting the one which gives an approximately 
straight line. With the present data it will be found that 
plotting y against logx gives an almo&t straight line, hence 
we assume that the observed scores will lie very closely about 
a curve whose equation is j/ = a log a;+ 6. We have therefore 
to calculate the constants a and b and then examine the 
goodness of fit by performing an analysis of variance. 

As in the previous example, we first construct and sum five 
columns. The first column will be values of log a;, which for 
convenience we may head X. The whole calculation is given 
below: 


X 


X 

( = logo;) 

y 

X* 


Xy 


1 

0-00 

2-0 

0-0000 

4-00 

0-0000 


2 

0-30 

3-3 

00900 

10-89 

0-990 


3 

0-48 

40 

0-2304 

16-00 

1-920 

12 

4 

0-60 

4*5 

0-3600 

20-25 

2-700 

S(X) = 8-68 

5 

0-70 

4-7 

0-4900 

22-09 

3-290 

S(y)^ 68-9 

6 

0-78 

6-0 

0-6084 

26-00 

3-900 

8(X) = 7-4618 

7 

0-86 

6-6 

0-7226 

30-26 

4*676 

8(y) = 307-66 

8 

0-90 

5-6 

0-8100 

31-36 

6-040 

8(Xy)^ 47-268 

9 

0-96 

6-9 

0-9025 

34-81 

6-606 


10 

1-00 

6-0 

1-0000 

36-00 

6-000 


11 

104 

6-1 

1-0816 

37-21 

6-344 


12 

1-08 

6-3 

1-1664 

39-69 

6-804 




68-9 

z = 

7-4618 

8-68/12 

307-66 

= 0-7233, 

47-268 



NXy = 0-7233 x 58-9 = 42-6024, 


NX^ = 0-7233 X 8-68 = 6-2782, 


47-268-42-6024 

7-4618-6-2782 


3-942, 


y = 68-9/12 = 4-9083, 

6-= 4-9083-3-942x0-7233 = 2-0671. 



A • ^ 

9.x] ;\;2.continoency:goodnbss of fit 113 

It appears, therefore, that the original data are best fitted 
by a curve which has the equation y = 3*942 log a;+ 2*0571. 
Substituting values of x from 1 to 12 in this equation we obtain 
the following comparison* of observed and calculated values 


X 

Observed y 

Calculated y 

1 

2-0 

2-057 

2 

3*3 

3-244 

3 

4-0 

3-938 

4 

4-5 

4-431 

6 

. 4-7 

4-813 

6 

5-0 

5-125 

7 

5*5 

5-388 

8 

5-6 

5-617 

9 

5-9 

5-819 

10 

G-0 

6-000 

11 

6! 

6-162 

12 

G-3 

6-311 


The fit appears to be very good. 

Examining this mathematically, as in the previous example, 
we obtain the following analysis of variance: 






Mean 




Sum of squares 

D.F. 

squares 

V.R. 

1. 

Total 

307-55-289-10 = 18-45 

11 

— 

— 

2. 

Due to linear regression 

1-184x3-9422= 18-40 

1 

18-40 

3680 

3. 

Due to departures from 

0-05 

10 

0-005 



linear regression 


This is a highly significant fit since the critical value of the 
\ ariance ratio for = 1 and 10 is 10*04 at the 1 % point. 

* 9.x. Polynomials. When a logarithmic curve which 

yields a straight regression line cannot be found, it may be 
necessary to fit si>polynomial curve to observed data. Examples 
of such curves are: 

» 

quadratic y = ax^ + bx + c, 
cubic y — ax^ + bx^ + ca; + d, 
quartic y = ax^^ + bx^ + cx^ + dx +*e, etc. 

C b C ^ 9 ^ 



% 

114 ;^;®;contingency:goodness of fit [9.x 

There are various ways of fitting such curves (for example, 
the method of least squares), but they all involve’ lengthy 
algebraical computations and will not be described in this 
book. (For a good account of fitting-polynomials see Groulden, 
Methods of Statistical Analysis, Chapter xiv, J. Wiley and 
Sons, New York. 1947.) 


EXERCISES ON CHAPTER IX 

30. (a) Construct contingency tables showing the relationship be¬ 
tween assessment I in Appendix E and each of the tests from AtoQ 
inclusive. The tables should bo 4 x 3 tables, using the four categories 
of I given and dividing the subjects in each test into three groups as 
nearly equal in size as possible according to their scores. Suitable 
points of division for the groups in each test are as follows: 


A (1) 39 and 

B (1) 164 and 
C (1) 19 and 

D (1) 24 and 

E (1) 116 and 
F (1) 24 and 

O (1) 33 and 


under: (2) 
under: (2) 
under: (2) 
under: (2) 
under: (2) 
under: (2) 
under: (2) 


40- 

49 

(3) 

165-194: 

(3) 

20- 

29 

(3) 

26- 

36 

(3) 

116-136: 

(3) 

2fi- 

30 

(3) 

34- 

40 

(3) 


50 and over. 
195 and over. 

30 and over. 
36 and over. 

] 37 and over. 

31 and over. 
41 and over. 


(6) Calculate ^he coefficient of contingency for each table 

and determine in which tables there is a significant relationship. 

31. (a) Construct 2x2 tables showing the relationship between 
assessment I and each of the tests from A to G inclusive, and also 
between assessment / and the order of merit H. In each case divide 
the assessment I into (1) V.G. and G.: (2) F. and P., and the tost scores 
above and below their means as given in the answer to Exercise 3 (6). 

(6) For each table calculate using formula (29), and determine 
in which there is a significant relationship. 

32. Investigate the association between test E and assessment I 
by constructing a 2 x 4 table and calculating x^ from formula (32). 
Divide test E into two categories according as the subjects are above 
or below the mean for the test, as given in the answer to Exercise 
3^(6), and take assessment I in the four categories as given in 
Appendix E. 

33. In a certain experiment readings had to be made from a scale 
which was graduated in tenths of an inch. A particular observer took 
lOOO readings and an analysis was made of the frequency of occurrence 




115 


;\;2:contingency:goodness of fit 


of the final digits of the observations, those showing the tenth of an 
inch for which each reading was taken. The frequency distribution of 
these digits was as follows: 

I 


Digit* Frequency 


0 

1 

2 

3 

4 

5 

• 6 


7 

8 
9 


161 

79 

96 

109 
60 

185 

67 

98 

110 
56 


1000 


There was no reason in the experiment for any particular final 
digits to occur more frequently than any others. Could this observer 
be regarded as reliable in reading the scale ? 

(The hypothesis to be examined hero is that the observed frequency 
of the digits does not depart significantly from that which would bo 
expected if the observer were completely reliable. On this hypothesis 
each digit is equally likely to occur, so that the exptjcted frequency is 
the same for each, namely, 100. Calculate f®st its significance 

for 9 degrees of freedom.) 


34. What is the equation of the straight lino which best fits tlu3 
following serial readings ? 

No. 1 2 3 4 6 6 7 8 

Reading 6 8 lOJ 13^ 17 21 24 26 

Examine the goodness of fit by the method of analysis of variance. 


8-2 



Chapter X < 

RANKING AND THE AGREEMENT OF JUDGES 


10.1. Paired comparisons: the coefficient of con- 
sistence. Frequently in experimental work the investigator 
wishes to rank a number of objects in order according to some 
quality which cannot be directly measured. He might, for 
instance, wish to obtain the order of preference of a number of 
pictures as judged by certain observers. Now it is often a 
matter of great difficulty for a judge to rank a whole group of 
objects in order of preference, but usually he can state his 
preference for one of a pair with ease. The method of paired 
comparisons makes use of this latter consideration by pre¬ 
senting the judge serially with all possible pairs of the objects 
to be judged, so that each object is compared with each other 
object separately. If there are n objects, then the number of 

possible pairs is which is equal to \n{n— 1). It may be 

seen that this number of pairs increases rapidly as n increases 
so that there is a practical limit to the number of objects that 
may be judged; the experiment would be too long and fatiguing 
with a large value of n. This may be seen from the following 
table showing the number of pairs for different values of n: 


” G) 

“ G) 

* G) 

2 

1 

6 

15 

10 

45 

3 

3 

7 

21 

11 

65 

4 

6 

8 

28 

12 

66 

5 

10 

9 

36 

13 

78 


\2J 

91 

106 

120 

136 


From the results of the comparisons of the pairs of objects 
a ranked order of preference of all the n objects may be 
worked out, but before doing this an estimate of the con¬ 
sistency of judgment exercised may be made. Suppose, for 
fxample, that there are 8 objects which may be designated 
D, E, P, 0 and H, Now if thfi judge prefers ^ to JS and 



lO.i] RANKING AND AGREEMENT OF JUDGES 117 

5 to (7 it follows that if he is consistent in his judgment he will 
prefer AtoC when that pair is presented to him. If, however, 
he is incons^istent he might, for example, prefer ^ to jB, JS to C 
and G to A, which may be' symbolised Si&A>B>C>A. This 
is known as a circular triad, and the number of such triads a 
judge gives in the whole series of paired comparisons provides 
an obvious measure of his degree of consistency. (It is possible 
for a judge to give circular polyads, i.e. four or more objects 
concerned, but such polyads may always be broken up into 
triads.) 

If n is the number of objects to be judged and d is the 
number of circular triads shown in the whole series of com¬ 
parisons, then the coefficient of consistence, f (zeta), may be 
calculated from one of the following expressions: 


if n is odd. 

^=1- 3 --, 

n^ — n 

(35) 

if n is even. 

24d 

^ 7l3-47i,' 

(35A) 


The next point is the method of determining d. This may be 
done by inspection of the series, but the method is laborious 
and liable to error with a Jong series of comparisons. The best 
method of finding d is as follows: 

Example 37. First construct a paired comparisons table as 
in Table VIII. This is a square table with rows and columns 
headed A, B,C, ..., n and has a diagonal line drawn from the 
cell AA to the cell nn. In the table below ri = 8. 

With 8 objects, 28 comparisons have to be made, and the 
. result of each comparison is entered on the table in the form of 
an X. An X in any cell indicates that the object at the head of 
the column is preferred to the object at the head of the row 
containing that cell. The number of X's in each column is 
counted and the totals entered at the foot of the table: tlni 
sum of these equals ^n{n— 1), i.e. with 8 objects the sum of 
the totals is 28. 

Now if there were no inconsistencies at all, the numbers at ^ 
the foot of the table would all be different and would run from 



118 


RANKING AND THE 




Table VIII. Paired comparisons table 



Total 76353220 28 


0 to 7 and would give the ranked order of preference. This is 
not the case in Table VIII for there are two 3’s and two 2’s 
and 4 and 1 are missing. It follows, therefore, that there were 
some inconsistencies of judgment. The exact number of 
triads may be found by calculating 

where is any column total. 

In the example, with n = 8, 

d = _ 1(72 + 62 + 32 + 62 + 3 * + 22 + 22) 

= 70-68 = 2. 


Inspection of Table VIII shows that these two triads are 
F >C>E>F and F>C>0>F. For this particular judge, 
d = 2 and = 8, so that his coefficient of consistence as given 
by formula (35 A) is 




24x2 

512-32 


= 0-90. 


We may regard him, therefore, as having a high degree of 
consistency and his preference ranking of the 8 objects may 
^be written as AB DCEFO H, C and E being assessed as equal 
and also F and 0. 




lO.iv] AGREEMENT OF JUDGES * 119 

lO.ii. Significance of JJ. The following table, calculated 
by J. W. Whitfield, gives the rrwLximum values of d for signi¬ 
ficance for values of n from 6 to 20: 


n 

d 

n 

d 

n 

d 

6 

0 

11 

30 

16 

121 

7 

3 

12 

42 

17 

149 

8 

8 

13 

67 

18 

181 

9 

13 

14 

75 

19 

217 

10 

20 

15 

96 

20 

258 


For ^ to be significant, d should be not greater than the value 
given in this table against the appropriate value of n. For 
values of n below 6, significance cannot be proved. 


lO.iii. Tied preferences. It often happens in practice 
that one or more pairs in a paired comparisons experiment are 
judged equal. There are various reasons why this may be so. 
A pair of objects may actually be equal in the quality being 
judged, or the judge may think them equal, or the judge may 
be incapable of making the discriminatory judgment. In the 
last instance he would probably show many inconsistencies 
in any case. Such ties are a trouble theoretically but from the 
experimental point of view they must be allowed. 


Example 38. Table IX shows the method of dealing with 
such ties when they occur. This is an example where n = 7. 
Here A and G had equal preference, so the figure J is written 
in both the AC and the CA cells. Thfe same is done for the other 
two equal pairs, AF and DG, From the columns’ totals the 
ranked order is CBADGEF, In this case d is 5, as calculated 
by the formula in the preceding section. Hence by formula (35) 




24x5 

343-7 


0-643. 


Since d is greater than 3, this value is not significant, so that 
the particular judge concerned would be regarded as in; 
consistent. 


10 .iv. The case of m judges: coefficient of agreement. 

The application of the method of paired comparisons by a^ 
single judge is chiefly of use when we know on* other grounds 



120 


RANKING AND THE 


Table IX 


[i0.iv 



the true ranked order of the objects judged. For example, a 
person might profess to be able to assess intelligence from 
photographs. We could examine his ability to do so by photo¬ 
graphing a number of people whose intelligence (as measured 
by intelligence tests) we knew and then getting the judge to 
rank the photographs for intelligence by the paired com¬ 
parisons method. The order he chose could then be correlated 
with the previously known order from tests by the ranking 
method. (See Chapter vii.) 

In practice it often occurs that we cannot directly measure ^ 
the quality which we want to rank and we have to rely on 
personal judgments for this ranking. We might, for instance, 
wish to rank a number of people for the quality of initiative. 
In such a case as this we should be unwilling to rely on the 
preference of a single judge. We should get as many competent 
judges as possible and explain to them carefully what we mean 
^by initiative before getting them each to rank the subjects 
separately by the paired comparisqns method. 



AGBEBMENT OF JUDGES 


121 


10.iv] 

Suppose we had n people to be assessed and m judges. We 
should first examine the consistency of each judge by the 
method described previously. This would involve making m 
tables of paired comparisons and calculating the coefficient of 
consistency for each. In practice we should then probably 
omit any judge or judges who were very inconsistent since 
their judgments would not be reliable. Let us assume here 
that all m judges showed a high degree of consistency. The 
next point to be decided is whether or not the judges agree 
with one another. Tliis may be examined by calculating the 
coefficient of agreement, u. 

To do this we first construct a summed paired comparisons 
table. This is done by drawing up an n x n table and entering 
in each cell .the sum of the entries in the m separate tables. 
For example, count the number of crosses and ^’s in all the 
m AB cells in the separate tables and write the total in the AB 
cell of the summed table, and so on. The columns are summed 
as usual and the total of these sums is \mn{n— 1). 


Table X. Summed j^aired comparisons table: n = 6, m == 5 



75 = i mn (n— 1) 
92 = S 



122 


RANKING AND THE 


[lO.iv- 

Example 39. Table X shows a summed paired comparisons 
table for 6 objects judged by 6 observers, so that n = 6 and 
m = 5. We shall deal first with the case where there are no J’s 
in the summed table in order to illustrate two methods of 
calculation. The number written in the top right-hand corner 
of each cell is the total of the entries in the 5 corresponding 
cells in the separate tables (which are not reproduced). Let 
this entry for any cell be symbolised by y (gamma). Then for 
each cell we calculate j 7 (y-~ 1 ) and write the answer in the 
bottom left-hand corner of each cell. If we add all these values 

together we obtain S > which we may call 2. Then w, the 

coefficient of agreement, is given by the formula 
_ 82 , 


mn(m— 1 ) (n— 1 ) 


From the data in Table X we obtain 2 = 92. Hence in this 
' , 8x92 , _ 


example u = - ;;—^—- -1 = 0*227. 

^ 6x6x4x5 

There is an alternative method of calculating 2 which may 
be preferred and which has to be used in any case when there 
are J’s in the summed table. Consider only the portion of the 
table below the diagonal. Sum the y's in this half to obtain 
2 ( 7 ). Then square each y and sum to obtain 2 ( 7 ^). 2 is then 
given by the formula 

S = + (37) 

In the example in Table X, we find 2 ( 7 ) = 43 and 2 ( 7 ^) = 157. 

Hence, from formula (37), 

2 = 157-5x43-f lOx 15 = 92. 

Note that the same result may be obtained as a check by 

confining ourselves to the portion of the table above the 

diagonal. In this case we find 2 ( 7 ) = 32 and 2 ( 7 ^) = 102 . 

Hence 2 = 102-5x32 + 10x15 = 92. 

In practice, apart from checking, one would work with the 

half of the table where the numbers were smaller. 

« 

♦ See Appendix H for a table to hel^ in the calculation of u. 




AOBEBMENT OP JUDGES 


123 


lO.vi] 

lO.v. Significance of u. The significance of u may be 
tested by calculating Making a correction for continuity 
(similar tj) Yates’s correction), we have in this case the fol¬ 
lowing expression for x^- 





m —2' 


(38)* 


The number of degrees of freedom, v (nu), corresponding to 
this value of x^ is giyen by 


{m-2^ 


(39)* 


The significance of x^ is then examined by calculating 
^{2x^)— 1). If this expression is greater than 1-65, x^ 

and consequently u is significant. 

In the example in lO.iv we have n = 6, m = 5 and S = 92. 
From formula (38) 


;^2 ^ (92_1 ^1(15x10)1}I = 54|. 
From formula (39) 


V = 


15x5x4 
3 xT~ 


33i 


Hence ^{2x^)-^ ^{2v- 1) = ^(lOd-3)-^(65-6) 

= 10-46-8*10 = 2-36. 


This is greater than 1-65, therefore u is significant. 

From the table of the normal curve (see Ref. 1, Appendix, 
Table 2), it may be seen that the probability of getting a value 
of x^ a^s large or larger than that obtained is P = 0-009. 

{Note. For exact probabilities for values of m from 3 to 6 
inclusive and small values of n, see Kendall, vol. i, Tables 
16-11-16-14, Ref. 6). 

lO.vi. Combination of several rankings. Suppose \^e 
have obtained rankings of n objects by m judges, either 
directly or by the method of paired comparisons, and we wish 

* See Appendix I for values of v and Ap pondi xJL-for a table to heip 
in the calculation of » 



124 RANKING AND THE [lO.Vi- 

to combine these separate rankings into a single ranking which 
may be taken as the best representation of the consensus of 
judgment. The simplest and best way of doing this is to sum 
the m rankings for each object and fe-rank the n totals thus 
obtained. 

Example 40. Suppose 3 judges ranked 8 objects as follows: 

ABCDEFOH 

Judge 1 42176368 

Judge 2 72164638 

Judge 3 74266318 

Sum 18 8 4 19 16 11 9 24 

Combined rank 62175438 

The last row is the combined rank order obtained by ranking 
the sums in the row above. 

Sometimes the sums of rankings for two objects may be 
equal, for example 1 -f 1 -f 3 = 5 and 2 + 2+1 = 5. These two 
objects would be tied in the combined ranking, but if for a 
particular purpose we wished to avoid ties, we should take 
as the smaller of the pair the one which gave the smaller sum 
of squares of its constituents. Thus the sum of the squares of 
1, 1 and 3 is equal to 11, whilst the sum of the squares of 2, 
2 and 1 is equal to 9, so that we should take the latter as of 
lesser rank than the former. If, however, there were tied 
rankings in the separate judgments, we should have to allow 
ties in the combined ranking. 

lO.vii. Coefficient of concordance. The degree of re¬ 
semblance between m rankings of the same n objects may 
be estimated by calculating the coefficient of concordance, W. 
To do this, first sum the m ranks for each object as in lO.vi. 
The total of these sums is \mn(n-\-\) and their mean is 
\m(n+ 1). Take the deviation of each sum from this mean 
aiid square it. Add the squares to obtain a total sum of squares 
which we will call S, 8 is then the variance of the sums of 
rankings. Now if the concordance amongst the judges were 
perfect, these sums would be m, 2m, 3m, ..., nm, and their 
variance would" be — The coefficient W is then 



lO.Vii] AGREEMENT OF JUDGES 126 

given by the ratio of the observed variance 8 to the variance 
for perfect concordance, i.e. 

— n) 

In Example 40, m = 3 and n = 8. The sums of ranks were 
18, 8, 4, 19, 16, 11,9 and 24. These total to 108 with a mean of 
13J. Then 8^ is equal to 

(18- 13i)2 + (8~ 13i)2+ ... + (24- 13J)2. 

The total of this is 310. Hence, from formula (40), 

w = = 0-82 

^ 9(512-8) 

W varies between 0 and 1 and is unity only when all the 
rankings are identical. 

W is related to the average Spearman p between all possible 
pairs of rankings, the relationship being given by 

mW — 1 
m— i 


/^av. — 


Formula (40) holds only when there are no tied ranks. When 
ties are present it needs modification. Consider any one row 
of rankings. If there are t ranks tied in any one set, calculate 
1^0 this for each set of ties and add for the whole row, 
obtaining “ 0* total for the first row, for 

the second and so on. Then if there are m rows, add the m 
values of T to obtain ^{T), Then the modified form of formula 
(40) becomes a 

(40 A) 


W = 


— n) — m'L(T )' 


The following small table is of help in calculating the T’s: 


i 

2 

5 

10 

in 


♦ See Appendix ^ for a table of — n)/12. 





126 RANKING AND THE [lO.vU- 

Example 41. Calculate the value of W for the following 
data, showing 3 rankings of the same 10 objeets: 


Object 

A 

B 

C 

D 

E 

• 

F 

0 

Hi 

j 

K 

Ranking 1 

1 

2 

3 

H 

4i 

6 

n 

H 

9 

10 

Ranking 2 

1 

H 

2i 


4i 

6i 

6i 

8 

9i 

9i 

Ranking 3 

1 

2 

_4i 




8 

8 

8 

10 

Sums 

3 


10 

13i 

13i 

17 

22 



29J 


The sums total 165 with a mean of 16^. Subtracting 16| from 
each sum, squaring and adding we get S = 691. 

Now in the first row there are 2 ties with 2 members each, 
so that Tj = J + ^ = 1. Similarly, Tj = 2 and 2^ = 7. Therefore 
2(2’) = 10. Then, from formula (40 A), 

^ — __= 0'970 

9(1000-10)-3 X 10 

lO.viii. Significance of W. The significance of W may 
be tested as for Fisher’s z (Section S.iii) by writing 



2 = 


(m-l)W 

~i-w • 


(41) 


The appropriate degrees of freedom for consulting the z table 
(Fisher and Yates, Ref. 5, Table V) are 






When ties are present, the same test may be used provided 
Tt(T) is small compared with —n). If a large part of 

S(T) is provided by the tied ranks from one judge, we might 
regard him as being relatively incapable of making the judg¬ 
ments and so omit his rankings. Great care has to be exercised, 
however, in omitting any experimental data—full justification 
has to be shown. A number of judges may have a high degree 
of concordance in their judgments and yet be wrong. An 
chitstanding case might disagree with the others and yet be 
right. ' 

An alternative form of formula (41) using logs to the base 
10 instead of natural logarithms is 

« = ’M513[log (m-1) W^log (1 - W)]. (41 A) 



lO.Viii] AGREEMENT OF JUDGES * 127* 

Example 42. Examine the significance of the value of W 
obtained in Example 40. Here m = 3, » = 8, IT = 0-82. 

Hence ^ M513[log(2x0-82)-log(1-0*82)] 

* = M513(0-2T148-1-2553) 

= 1-1513x0-9595 
= 1-1047. 

= 6^ and = 12|. From Table V in Fisher and Yates 
(Ref. 5) we find for = 6 and ^2 = 13 a value of z of 0*5350 
for the 5 % point of significance. The calculated value is much 
greater than this and is therefore significant. 

Using formula (41) we get 

(m-l)lF_ 1-64 _ 
l-W ~ 0-18 - 
logroll = 2-2094, 

2 = ^log„9-ll = 1-1047. 

(Note. The methods described in this chapter are mainly 
due to M. G. Kendall; for their derivation, etc., see his Rank 
Correlation Methods, Griffin and Co., 1948.) 

EXERCISES ON CHAPTER X 

35. In a paired comparisons experiment, a subject was required to 
rank 7 photographs, labelled A to (7, in order of the intelligence of 
expression portrayed. All possible pairs were presented in random 
order and the 21 judgments made were as follows (recorded hero in 
regular order for convenience): 


A>B 

B>D 

C<G 

A>C 

B>E 

D>E 

A>D 

B<F 

D>F 

A<E 

B<G 

D<G 

A>F 

C>D 

E>F 

A<G 

C>E 

E<G 

B>C 

C>F 

F<G 


Construct a paired comparisons table from those data. , 

(а) Calcinate d, the number of circular triads, and find out from the 
table which are the triads. 

(б) Calculate the coefficient of consistence. 

(c) What is the ranked order of the 7 photographs? 

(d) What conclusions would you draw from theseffindings ? 



*128 BANKING AND AGREEMENT OF JUDGES 

36. The experiment in the previous exercise was made on 6 subjects 
and the results are shown in the summed paired comparisons table 
below; 



(а) Calculate u, the coefficiont of agreement, and examine its 
,significance. 

(б) What conclusions do you now draw about the photographs ? 

37. Ten men in a factory were ranked in order of their degree of 
possession of initiative by each of 4 foremen. Those were the rankings; 


Workman 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Ranking 1 

4 

3 

9 

10 

1 

8 

2 

6 

5 

7 

Ranking 2 

4 

1 

10 

9 

2 

5 

3 

8 

6 

7 

Ranking 3 

7 

4 

8i 

8J 

H 

6 


6 

3 

10 

Ranking 4 

5 

2 

8 

9 

2 

7 

2 

10 

4 

6 


Calculate W, the coefficient of concordance between the rankings and 
examine its significance. 




^ Chapter XI 

INTRODUCTION TO ANALYSIS OF VARIANCE 

ll.i. Fundamental nature. The technique of analysis 
of variance (which is really a misnomer, as may be seen later) 
was developed by R. A. Fisher. It has since been so widely 
used and elaborated* that the literature on the subject is 
enormous and discouraging to the elementary student. The 
object of this chapter is to explain the fundamental nature 
of the method and to give some simple illustrations of its 
use. 

Essentially, analysis of variance provides a test of the 
homogeneity of a set of data. By homogeneity we mean that 
all the observations could have been drawn from the same 
population, which we may call the parent population, this 
population being normally distributed and having a certain 
mean and variance. In any experiment there may be several 
factors at work, each of which may cause a certain amount of 
variability in the observations made. Thus if several varieties 
of potatoes are sown in different types of soil and treated with 
different sorts of manure, the variation in the various yields 
of tubers may be affected by variety of seed, by variety of 
soil or by variety of manure. Analysis of variance enables us 
to discover how much of the total variability is due to each 
factor and a comparison of these contributory amounts of 
variation provides a test of the homogeneity of the observa¬ 
tions. If the data are shown not to be homogeneous, then the 
factor or factors causing the heterogeneity may be isolated 
and the relative amounts of their effects discovered. It should 
be emphasised at the outset that for this analysis to be pos¬ 
sible, it is essential that the experiment be very carefully 
planned s<f as to justify the assumptions on which the method 
of variate analysis is based. These assumptions will be pointed 
out later. 

The method derives from the fact that variance is an 


CSC 


9 



130 


INTEODTTOTIOK TO 


additive quantity. Suppose X and Y are two independent, 
normally distributed variables. (By independent is meant 

uncorrelated.) Then the variance of X is ~S{X — X)^ and 

that of F is ^ ;S( F — F)®, where N is the number in a sample of 
each variable. In a similar way, the variance of (JC + F) will be 

^8{(X+ F)-(ZTT)}® = 1 ;S(X+ Y-X- F)®. 

Rearranging and expanding we have 

l5{(Z-Jr) + (F-F)}® 

= i -S{(Z - X)® + (F - F)® + 2(X - J) (F - F)} 

= -^S(X-Xr + ^SY- Y)^ + :^(X-X)(Y- F). 

The last term on the right is twice the co-variance of X and Y 
and since the two variables are uncorrelated this must be 
equal to zero. Hence the variance of 

(X+Y) — (variance of X) -f (variance of Y), 

The same sort of reasoning may be applied to three or more 
variables. 

If therefore we have a set of observations each of which may 
be regarded as falling into several independent classifications, 
the total variance may be analysed into several variances, one 
for each qf the separate classifications. Further, if the classes 
are really independent and the data are homogeneous, then 
the variance in each classification is an independent estimate 
of the parent variance, and these separate estimates will all 
be equal within the limits of sampling error. If they are not 
equal, then the data are not homogeneous and we cV>me to the 
conclusion that the data in different classes are drawn from 
Afferent populations having different means. The important 
point to remenlber in applying the test for homogeneity (how 



ANALYSIS OB' VAEIANOB * 131 * 

this is done is explained below) is that three assumptions are 
made: 

(1) that the data are normally distributed; 

(2) that^he separate estimates of variance are independent; 

(3) that the variance in each class is the same. 

ll.ii. One-way classification. In Section 5.iii we ex¬ 
amined the probability that two sample means came from 
the same population, using the t test. If we had three or more 
samples it would be possible to apply the t test to each pair 
of means, but the homogeneity of the samples can be tested 
more easily by the analysis of variance. 

Exan^le 43. Numbers of leaves were taken from each of 
half a dozen trees and their lengths measured. The following 
are the measurements in millimetres: 

Tree Length 

1 82, 87, 86, 90, 81, 84 

2 86, 84, 91, 92, 88 

3 92, 90, 84, 86, 88, 93, 89, 90 

4 80, 86, 87, 81, 82, 82 

6 87, 86, 88, 90, 86, 86, 87 

6 90, 96, 84, 86, 86, 86, 87, 84, 87 

What we wish to know is: can all these leaves be regarded as 
having come from the same species of tree? 

Now if X represents the length of any leaf and we lump all 
the samples together to form one group of number N, then 
the variance of the whole group is given as usual by 

S{X-X)^ 

^ ~ N-1 ’ 

since we are estimating the variance of the parent population. 

* The expression S(X — X)^ is the sum of the squares of 
deviations from the mean and will be referred to for con¬ 
venience as s.o.s. This sum of squares may be analysed into 
two parts, each one of which provides an independent estimate 
of the parent variance. • 

Let X be the general mean of the whole group and X^ any 
sample mean. Then the total s.o.s. may be written 
S(X ~ Xp = S(X - X, + - X)«. 


9-2 



' 132 * INTBODUOTION TO [ll.U 

Grouping and expanding the right-hand side we get 
S{(X-I,) + {X,-X)}^ 

= S(X -X,)^ + S(X, - X)? + 28{X - X,) tx, - X). 

The product term on the right equals zero, hence we may write 

8{X- Z)2 = 8{X-X,)‘ + 8(X,-Xf. 

The first part on the right is the s.o.s. of the observations in 
each sample about their respective means: the second is the 
s.o.s. of the sample means about the general mean. To 
calculate these sums of squares we make use of the identity 

8(X - Z)2 = 8{X‘) - [8(X)flN. 

First we find 8(X) and 8(X^) for each sample. These may con¬ 
veniently be arranged with the sample means as shown in 
Table XI. 




Table XI 


Sample 

No. in 
sample, N, 

S(X) 

S(X^) 


1 

6 

610 

43,406 

86 

2 

5 

440 

38,770 

88 

3 

8 

712 

63,430 

89 

4 

6 

498 

41,374 

83 

6 

7 

609 

52,999 

87 

6 

9 

774 

66,592 

86 

Totals 

41 = JV 

3543 

306,571 

86-41 = X 


Note that for sample 1, S(X^) is given by 

82* + 87* -t- 86* + 90* -I- 81* -I- 84* = 43,406, 
and similarly for the other samples. 


The total s.o.s. is then 

306671-36432/41 

• = 306671-306167-06 

= 403-96. 

The s.o.s. of the sample means about the general mean, i.e. 
&(Xg—X)*, is obtained by squaring 8(X) for each sample. 



ANALYSIS OF VARIANCE 


t 


11.ii] 


133* 


dividing each square by Ng, the number in that sample, adding 
the six results and subtracting [8{X)YIN. Thus we have 


SiXg-X)^ 


610* 440* 712* 498* 609* 774* 3643* 

-j-J-j-1-1- 

6 6 8 6 7 8 41 

306319-306167*05 
161*95. 


The s.o.s. of sample observations about sample means is 
obtained by subtracting [S{X)YIN^ from S(X^) for each sample 
and adding the six results. We have, therefore, 

S(X - X,)^ = (43406 - 5102/6) + (38770 - 4402/5) 

• + (63430 - 7122/8) + (41374 - 4982/6) 

+ (52999 - 6092/7) + (66692 - 7742/9) 

= 56 + 50 + 62 + 40+16 + 28 
== 252. 

We now have the necessary sums of squares. To obtain the 
estimates of variance these have to be divided by the respec¬ 
tive number of degrees of freedom (b.f.). The total number of 
D.F.is41 — 1 = 40, IftherearePsamples,theD.F. = P— 1 = 5 
in this case. The d.f. from individuals in the samples about 
their sample means = S{Ng— 1) = ^ —P = 35. The complete 
analysis of variance then takes the following tabular form: 


s.o.s. 


D.F. 

Mean square 

Between samples 

151*95 

5 

30*39 

Within samples 

252 

35 

7*20 

Total 

403*95 

40 

10*099 


It will be noted that the first two columns of figures are 
additive but not the last column, which gives the mean squares 
or estimates of the parent variance. In other words, we have 
actually analysed the total s.o.s. and not the variance. The 
first two f^ariances in the last column are independent esti¬ 
mates and so may be compared. The variance calculated from 
the total s.o.s. is not independent since it includes the other 
two. 



134 INTEODUOTION TO [ll.li- 

There are now two ways of testing the homogeneity of the 
samples. First we may calculate 


|(l0S 


S(X.-X)‘ S(X-X,)> ) 

P-1 j\r-p r 


and consult Table V in Fisher and Yateses Tables for = 5 
and ^2 = 36. In this case z = 0*7200. From the table for 

= 5 and = 30 we see a value of z of 0*4648 for P = 0*06. 
Since the calculated z is much bigger than the one in the tables, 
it means that the data depart significantly from homogeneity. 

Alternatively, and more simply, we may calculate the 
variance ratio, v.r., by dividing the variance between samples 
by the variance within samples and again consulting Table V 
in Fisher and Yates. In this case v.r. = 30*39/7*2 = i*22. In 
the table for /ij = 6 and = 30 we find a value for the v.r. 
of 2*63 for P = 0*06, again showing that the calculated v.r. 
would occur by chance very much less often than 1 in 20 times. 
Hence the data depart significantly from homogeneity. 

The above example has been worked out in full to show 
exactly what is being done. In practice the arithmetic may be 
reduced. First we take a working mean of, say, 86 and reduce 
the original data to the following: 


Tree Length (from 86) 

1 -3, 2, 1,6,-4,-1 

2 0,-1, 6, 7, 3 

3 7, 6,-1, 1, 3, 8, 4, 6 

4 -6, 1, 2,-4,-3,-3 

6 2, 1, 3, 6, 0, 1, 2 

6 6 , 1 ,- 1 , 0 , 0 , 1 , 2 ,- 1,2 


From these data Table XI becomes Table XI A. 


Table XI A 


Sample 


S(X) 

S(X^) 


1 

6 

0 

66 

0 

2 

6 

16 

96 

3 

3 

8 

32 

190 

4 

4 

6. 

-12 

64 

-2* 

6 

7 

14 

44 

2 

6 

9 

9 

37 

1 

Totals 

41 N 

ii 

4^ 

ril 



J 


ANALYSIS OF VABIANCB ^ 136 

From this, total 

S.O.S. = 486-682/41 = 486-82-06 = 403-95. 

For the sum of squares between samples we make use of the 
identity _ _ 

S{X, - Z)2 = 8[S(X) Z J - [S{X)YIN- 

In this case we have 

S.O.S. between samples 

= (0x0) + (15x3) + (32x4) + (-12x -2) 

-f (14x2) + (9x l)-58x58/41 
= 04-45+128 + 24 + 28 + 9-82-05 
• = 151-95. 

The S.O.S. of observations about their respective sample 
means need not be calculated directly but may be obtained by 
subtracting the s.o.s. between samples from the total s.o.s. 
This may be entered in the analysis of variance table as the 
‘remainder’. The table then takes this form: 


s.o.s. 


D.F. 

Mean square 

V.R. 

Between samples 

151-95 

5 

30-39 

4-22 

Remainder 

252 

35 

7-20 

— 

Total 

403-95 

40 

10099 



11 .iii. Interpretation of analysis. Having decided from our 
analysis that the data are not homogeneous we have to ask 
what this signifies. To begin with, it means that the six 
examples are not all drawn from the same parent popula¬ 
tion. The sample means vary from 83 to 89 and we may decide 
fairly definitely that the sample with the smallest mean length 
^was taken from a different species of tree from that with the 
largest mean length. How much further may we go ? This may 
be decided by calculating the standard error of the difference 
between sample means. Now we have three estimates of 
variance. If the data were homogeneofis, the best estimate 
would be tSie one based on the largest number of d.f., i.e. that 
obtained from the total s.o.s. However, the data are not 
homogeneous, so that the best estimate of the variance for 
calculating the standard ^error is the ‘remaiAder^ variance, 



INTRODUCTION TO 


'l36 




since from this the variance due to differences between sample 
means has been eliminated. This variance is 7*20, so that the 
standard error of a sample of size N drawn from a population 
of estimated variance 7*20 will be ^Jl7^20lN) = 2*663/^iV'. 

The standard error of the difference between two means of 
samples of and N 2 respectively will be 

2*683 J Chapter v). 

The difference between the means of samples 3 and 4 in 
the previous example is 89 — 83 = 6 and the numbers in the 
samples are 8 and 6. Hence 

S.e, of the difference c 

= 2-683 V(l/8+1/6) = 2-683x0-54 
= 1-449. 


Difference/S.e. of difference 
= 6/1-449 = 4-14. 

Normally we should infer from this, if we had only two 
samples, that this difference between the means is definitely 
significant. However, a little care is necessary in coming to 
this conclusion. We have 6 samples so that we might compare 
15 different pairs of means. Above we chose the biggest differ¬ 
ence so that we ought to apply a stricter test of significance 
than that of P = 0-05. It seems reasonable in this case to 
demand that the probability of a difference of the observed 
size arising by chance should be 1 in 20 x 15, i.e. 1 in 300. This 
corresponds roughly to a case where a difference is 2-9 times 
its standard error, as may be seen from the table of the pro¬ 
bability integral. Hence we may take it in this case that any ^ 
difference between means that is 2-9 times its standard error, 
or greater, is significant. Applying this test to all possible 
pairs in the example we obtain the results shown in the table 
on p. 137. 

The order of the samples arranged according to iliean value 
is 3, 2, 5, 6, 1, 4. From the table it may be seen that only 
samples 3 and 4 and 2 and 4 are significantly different. We 
should suspect; therefore, that samples 2 and 3 came from a 



ll.iV] ANALYSIS OF VARIANCE ^ 137^ 


Samples 

Difference 

S.e. of diff. 

Diff./S.e. 

1 and 2 

3 

1*626 

1*8 

1 and 3 

•4 

1*449 

2*8 

1 and 1 

2 

1*649 

1*3 

1 and 5 

2 • 

1*493 

1*3 

1 and 6 

1 

1*414 

0*7 

2 and 3 

1 

1*630 

0*7 

2 and 4 

5 

1*626 

3*1 Significant 

2 and 5 

1 

1*671 

0*6 

2 and 6 

2 

1*497 

1*3 

3 and 4 

6 

1*449 

4*1 Significant 

3 and 5 

2 

1*388 

1*4 

3 and 6 

3 

1*304 

2*3 

4 and 5 

4 

1*493 

2*7 

4 and 6 

3 

1*414 

2-1 

5 and 6 

1 

1*362 

0*7 


different species of tree from sample 4 but which of the other 
samples came from which species cannot be deduced from the 
present data. There is an indication that with larger samples 
the differences between the means of 3 and 6 and 3 and 1 might 
appear as significant, and also that between 4 and 5. We 
might suspect then that samples 3, 2 and 5 came from one 
species of tree and samples 6, 1 and 4 from another, but we 
could not prove this from the present data. For this we should 
need larger samples or some other means of identification, 
such as shape. 

11.iv. Two-way classification. In the last example, 
the data all belonged to a single family, leaf-lengths. It 
frequently happens that a set of observations may be regarded 
as belonging to two families at the same time. 

Example 44. In an experiment on the effects of tem¬ 
perature conditions on human performance, 8 practised 
subjects were given a sensori-motor test in each of 4 tempera¬ 
ture conditions. Since the subjects were well practised, the 
order in which the tests were done was unimportant. The tests 
were raniomised amongst the subjects, so that for each con¬ 
dition there were equal numbers of first testing, second testing, 
third testing and fourth testing. The scores in the tests are 
shown below. 



f 


INTBODIXCTION TO 


[11-iv 


'l38 

Tablb XII 

Subjects (A) 




1 

2 

3 

4 

•6 

6 

7 

<'8 

Totals 

Tempera¬ 

1 

76 

80 

79 

90 

86 

101 

94 

83 

688 

tures (B) 

2 

76 

81 

77 

90 

86 

98 

93 

86 

685 


3 

76 

78 

76 

91 

82 

98 

92 

83 

676 


4 

68 

76 

72 

86 

82 

90 

82 

77 

631 

Totals 


295 

314 

304 

366 


387 

361 

328 

2680 


Call the number of subjects and the number of conditions 
Nf,, In such a case as this the total s.o.s. may be analysed into 
three parts: 

(i) the s.o.s. of A means about the general merm, i.e. 
/S(X„-X)*, with Na-\ D.P.; 

(ii) the s.o.s. of B means about the general mean, i.e. 

/S(X(, —X)®, with 1 D.F.; 

(iii) a remainder obtained by subtracting (i) and (ii) from 

the total s.o.s. This has (N^— 1) d.f. 

The total s.o.s., with — 1 d.p., is calculated as in the 
previous example by summing the squares of all the readings 
and subtracting [S(X)YIN. For convenience and to reduce the 
arithmetic the data in Table XII may be written down from 
a working mean of 84 (since 2680/32 = 83*75). This yields 
Table XII A. 

Table XIIA 

Subjects {A) 




1 

2 

3 

4 

6 

6 

7 

8 

Totals 

Tempera¬ 

1 

- 8 

- 4 

- 6 

6 

1 

17 

10 

-1 

16 

tures (B) 

2 

- 9 

- 3 

- 7 

6 

2 

14 

9 

1 

13 


3 

8 

- 6 

- 8 

7 

-2 

14 

8 

-1 

1 


4 

-16 

- 9 

-12 

1 

-2 

6 

-2 

-7 

-41 

Totals 


-41 

-22 

-32 

20 

-1 

61 

26 

-8 

- .8 


The total s.o.s. 

'= (,_8)a + (-4)2 + (-6)2 + ... + (-2)® + (-7)a-(T;8)V32 
= 2042-2 = 2040. 

Great care should be taken with this part of the calculation 
as there is no simple check on the arithmetic. 



ll.iv] ANALYSIS OF VARIANCE * 139* 

To obtain 8{X^ — X)^, square and sum the 8 totals for the 
A grouping, divide by 4 (since there are 4 readings in each), 
and subtract [S(X)'PIN. 

Hence • • 

8(1^ _ = J[( - 41)2 + (- 22)2 + (_ 32)2 + 202 + (- 1 )2 

+ 612 + 252 + (-8)2]-2 
= 6880/4-2 
= 1718. 

In a similar way, —JC)2 is obtained by squaring and 
summing the 4 totals for the B groups, dividing by 8 and sub¬ 
tracting [8{X)yiN. Hence 

* >S(Z6-Z)2 = J[162+132-|-42 + (-41)2]-2 
= 1922/8-2 
= 238-25. 


We may now construct the analysis table as below, obtaining 
the ‘remainder’ s.o.s. by subtraction. 


S.0.8. 

Between subjects 
Between temperatures 
Remainder 


1718 

238-25 

83-75 


D.F. Moan square v.R. 
7 245-4 61-5 

3 79-42 19-9 

21 3-99 — 


Total 


2040 31 65-8 


In this table the first three estimates of the parent variance, 
given in the mean square column, are independent. The v.R. 
is obtained in each case by calculating the ratio of the first 
two variances to the remainder variance. Reference to Fisher 
and Yates’s Table V shows that both these ratios are highly 
significant, so that we reject the hypothesis that the data 
are homogeneous. The variance between sul3jects is very large, 
showing that there are highly significant differences in ability 
between the subjects, but we are not interested in this. What 
does interest us is the effect of temperature conditions. The 
means fo# these conditions, obtained by dividing the B totals 
in Table XII by 8, are 86, 85-625, 84-5 and 78-875. Since the 
data are nojb homogeneous we take the remainder variance as 
the best estimate of the parent variance, so that the standaird 



^140 * INTRODITCTION TO [ll.iV- 

error of a mean of N individuals drawn at random will be 
V(3-99/i\r) = 21 ^N. 

Now we have 6 possible pairs of means to compare, hence it 
is reasonable to use a level of significance equal to 1 in 20 x 6, 
i.e. 1 in 120. This corresponds to a ratio between a difference 
and its standard error of about 2-6. Since there are 8 readings 
in each temperature mean, the standard error of the difference 
between the means of two samples is 2 + = 1* Hence 

means differing by more than 2-6 will be significantly different, 
i.e. the mean of the fourth temperature condition is signi¬ 
ficantly different from the means of the other three conditions, 
but these three do not differ significantly from one another. 

ll.v. Three-way classification. In the previous ex¬ 
ample, the remainder s.o.s., which would be symbolised as 
— — + is sometimes referred to as an inter¬ 

action. When we have a set of data which can be classified in 
three ways. A, B and (7, there will be three such terms, one for 
the interaction of A and J5, one for A and C and the third for 
B and C. The total s.o.s. in such a case may therefore be 
analysed into seven parts. First there will be three parts given 
by 8{X^-X)\ 8{X^-Xf and 8(X,^X)^] these show the 
effects of A, B and C separately and are usually known as the 
main effects. Next will be the three interaction terms men¬ 
tioned above. Interactions involving two main effects are 
known as first-order interactions. Finally, the seventh part in 
the analysis is a remainder term. This term is in fact the inter¬ 
action of all three main effects and is known as a second-order 
interaction. The number of degrees of freedom involved in 
interactions is always the product of the d.f. of the component 
main effects, so that if the numbers of groupings in the classes 
A^ B and C are p, q and r respectively, the d.f. for interaction 
AB will be (p—l){q—l)y for AC 1)(?—1) and for BC 
(g — 1) (r — 1), whilst for the second-order interaction ABC the 
D.F. Vill be (p— 1) (g— 1) (r— 1). The total d.f. ar^ {P9T^ 1) 
and the student may easily verify by algebra that this is equal 
to the sum of the d.f. of the seven parts in the analysis. 

" The calculation of the various sums of squares is similar 



ll.v] ANALYSIS OF VABIANCE * 141^ 

in nature to that of the previous example but a little more 
complicated. The method is fully illustrated in the following 
example. 

Example 46. In a certain psychological experiment the 
apparatus consisted of a dial having a rotating needle and a 
number of graduations round the circumference. The needle 
could be rotated at 3 different speeds and the dial displayed 
under 3 different intensities of illumination. Subjects in this 
experiment were required to make a certain reaction each 
time the needle readied a graduation on the dial. Since there 
are nine combinations of speed and illumination, each subject 
had 9 tasks to perform. The order of performance was ran¬ 
domised for different subjects. Table XIII below gives the 
number of correct reactions made in each of the 9 tasks by 
6 different subjects. The object of the experiment was to dis¬ 
cover the relative effects of speed and intensity of illumination 
on performance. {Note. This is a greatly simplified version 
of an actual experiment and is given here solely for the purpose 
of illustrating the analysis of variance with such data.) 

Table XIII 

Illuminations (A) 

/'' ■ " I 

1 2 3 



Speeds {B) 

_A_ 

Speeds (B) 

Speeds (B) 


1 

2 

3 

1 

2 

3 

1 

2 3 

Subjects (C) 1 

46 

38 

29 

43 

36 

26 

35 

29 18 

2 

38 

33 

20 

40. 

32 

21 

34 

25 19 

3 

39 

32 

21 

41 

29 

25 

29 

24 16 

4 

43 

37 

24 

39 

30 

25 

30 

27 14 

5 

40 

36 

28 

42 

31 

24 

31 

26 17 

6 

40 

35 

25 

40 

32 

22 

32 

26 16 


This table gives a three-way classification as each entry 
belongs to a certain subject {C) at a certain speed (B) under a 
certain intensity of illumination (A). The first step in the 
analysis is to find a convenient working mean and rewrite 
Table XHI from this mean, inserting totals at the same time. 
Since the range of readings is from 14 to 45, a convenient 
working mean is 30. Rewriting the readings about this mean 
we get Table XIII A. ^ • / 



c 


INTRODUCTION TO 


^42 


[11 .V 


Table XIIIA 


Illuminations (A) 

--- 

1 2 , 3 , 

Speeds (B) Speeds (B) Speeds (B) 

' ■■ N . ■ .1 II V ^ . I . ■—i 


Subjects (C) 

1 

2 

3 

Total 

1 

2 

3 

Total 

1 

2 

3 

Total 

Totals 

1 

15 

8 

- 1 

22 

13 

6 

- 4 

14 

5 

- 1 

-12 

- 8 

28 

2 

8 

3 

-10 

1 

10 

2 

- 9 

3 

4 

- 5 

-11 

-12 

- 8 

3 . 

9 

2 

- 9 

2 

11 

-1 

— 5 

5 

-1 

- 6 

-14 

-21 

-14 

4 

13 

7 

- 6 

14 

9 

0 

- 5 

4 

0 

- 3 

-16 

-19 

- 1 

5 

10 

6 

- 2 

14 

12 

1 

- 6 

7 

1 

- 4 

-13 

-16 

5 

6 

10 

5 

- 5 

10 

10 

2 

- 8 

4 

2 

- 4 

-14 

-16 

- 2 















Totals 

65 

31 

-33 

63 

65 

9 

-37 

37 

11 

-23 

-80 

-92 

8 


The total s.o.s. may be obtained from this table in the usual 
way, i.e. square each single entry (not the totals), a^dd and 
subtract [S(X)]^/N. In this case 

• total s.o.s. = 16* + 8* + (-l)* + ... 

+ 2* + (-4)2 + {-14)2-82/54 
= 3402-1-185 
= 3400-815. 

Main effects 

(1) The totals for the A grouping are 63, 37 and — 92, each 
being the sum of 18 readings. 

Hence S(X„-Xf = [632 + 372 + (-92)2]/18 - 82/54 
= 13802/18-1-186 
= 766-593. 

(2) The totals for B are obtained by adding together the 
three totals for B^, B^ and B^ separately, i.e. 

total Hi = 65 + 66+11 = 141, 

total Ha= 31+ 9-23= 17, 

total Hj = - 33 - 37 - 80 = -160. 

Hence Sil^-X)^ = [14l2+172+{-160)2]/18-82/64 
' ^ = 42670/18-1-186 

= 2369-370. 

(3) The C totals are given at the right of Table XIIIA and 
each contains 9-readingB. 



ll.v] ANALYSIS OF VAEIANCB 143* 

Hence 
S(X,^X)^ 

= [282 + (« 8)2 + (- 14)2 + (- 1)2 + 52 + (- 2)2]/9 - 82/54 
* 1074/9-1-185 
= 118-148. 

First’Order interactions 

The interactions AB^ AG and BG are best found by first 
extracting three two-Vay tables from Table XIII A. 

(4) That for AB is shown in Table XIIIB. 


• 

■ 

Table XIII B 

A 3 

Totals 

fit 

66 

65 

11 

141 


31 

9 

-23 

17 


-33 

-37 

-80 

-150 

Totals 

63 

37 

-92 

8 


Now reference to the previous example shows that, to 
calculate the s.o.s. for the remainder or interaction in a two- 
way table, we calculated the total s.o.s. for the table and 
subtracted from it the sum of the s.o.s. of the main effects. 
Here we proceed likewise, bearing in mind that each reading in 
Table XIIIB is the sum of 6 readings. Hence for Table XIIIB, 
the total s.o.s. = [662 + 6524 -112 + 312 +92 +(-23)2 

+ (- 33)2 + (- 37)2 + (- 80)2]/6 - 82/54 
= 19000/6-1-185 
= 3165*482. 

We have already calculated the s.o.s. for the main effects A 
and B, which are involved in this interaction, hence 

interaction s.o.s. for AB = 3165*482 —765*693 —2369*370 

= 30*519. 

(5) Next we extract the two-way table for interaction AC^ 
each reading of which will be the sum of 3 readings. This is 
shown in Table XIIIC. 



144 


INTBODXJCTION TO [ll.V- 


Table XIIIC 




A 2 


Totals 

Cl 

22 

14 

. - 8 

, 28 

C, 

1 

3 

-12 

- 8 

Ca 

2 

5 

-21 

-14 

C4 

14 

4 

-19 

- 1 

C5 

14 

7 

-16 

6 

c. 

10 

4 

-16 

- 2 

Totals 

63 

37 

-92 

8 


From this table, interaction s.o.s. for AC is 
[22* +142 + (- 8)2 +... +10® + 4* + (- 16)2]/3 - 82/54 

-766-593-116-148 
= 2814/3-1-185-765-693-118-148 
= 53-074. 

(6) Finally we have, for interaction BC (Table XIIID): 


Table XIII D 




B, 

^3 

Totals 

Cl 

33 

12 

- 17 

28 

Ca 

22 

0 

- 30 

- 8 

c. 

19 

-5 

- 28 

-14 

C 4 

22 

4 

- 27 

- 1 

Ca 

23 

3 

- 21 

6 

Ca 

22 

3 

- 27 

- 2 

Totals 

141 

17 

-160 

8 


This table is obtained by adding the three entries in 
Table XIIIA for C^, i.e. 16 +13 + 6 = 33, and so on. 

From Table XIIID, interaction s.o.s. for BC is 

[33® +122 + (- 17)2 +... + 222 + 32 + (- 27)2]/3 - 82/54 

-2369-370-118-148 

' ^ =7506/3-1-186-2369-370-118-148 

= 13-297. 

We may now complete our analysis table as under, obtaining 
tfip remainder term by adding the^s.o.s. for the three main 



ll.Vi] ANALYSIS OF VARIANCE • 146* 


eflFects and the first-order interactions and subtracting from 
the total s.o.s. The remainder variance is as usual used as the 
denominator of theV.R.’s. 


• • 



Mean 


s.o.s. 


D.F. 

squares 

V.R. 

Between illuminations (A) 

766-693 

2 

382*80 

160-7 

Between speeds {B) 

2369-370 

2 

1184-68 

466-3 

Between subjects (C) 

118-148 

6 

23-63 

9-3 

Interaction AB 

30-619 

4 

7-63 

3*0 

Interaction AG 

53-074 

10 

6*31 

2-09 

Interaction BC * 

13*297 

10 

1-33 

0*62 

Remainder 

50-814 

20 

2-64 

— 

Total 

3400*815 

63 




11 .vi. Interpretation of analysis. The most striking feature 
of this analysis is the very large variance ratios for speeds 
and illuminations. The data obviously depart significantly 
from homogeneity—reference to tables is unnecessary with 
ratios of this size. The experiment shows that both speed 
and illumination have a very marked effect on performance, 
especially speed. Differences between subjects are also signi¬ 
ficant but are of no particular interest. Reference to Fisher and 
Yates’s table shows that the interaction AB is just significant 
at the 5 % level, but neither of the other first-order inter¬ 
actions is significant. The meaning of this significant inter¬ 
action AB is that the effects of speed and illumination are not 
entirely independent. 

Before, however, we accept this interaction as important 
we may apply a further test to it. Since the interactions AC 
and BC are not significant they may be combined with the 
‘remainder’ to give us a larger number of d.f. for estimating 
the parent variance. This is done by adding the s.o.s. for the 
interactions AC and BC and remainder, and also their d.f., 
which gives us a new remainder variance, estimated with 
40d.f., which may be used as a denominator for the v.R.Ss. 
In general, the more degrees of freedom there are available 
for estimating a variance, the more reliable is the estimate. 
The new analysis table produced by this means is given 

below. • * 

• • 


CSC 


10 



'146 ' INTBO 

DUOTION 

TO 


[11.Vi 




Moan 


* s.o.s. 


D.F. 

squares 

V.R. 

Between illuminations (A) 

765*593 

2 

382*80 

130*7 

Between speeds (B) 

2369*370 

2 

1184*68 

404*4 

Between subjects (C) 

118*148' 

5 

23*63* 

8*1 

Interaction (AB) 

30*519 

4 

7*63 

2*60 

Remainder 

117*185 

40 

2*93 

— 

Total 

3400*815 

53 




Consulting Fisher and Yates’s Table V with Yj = 4 and 
jYg = 40, we find a value for the v.r. of 2-61 for P = 0'05. This 
means that now we have employed a better estimate of the 
parent variance based on a larger number of d.f., the inter¬ 
action AB is just on the borderline of significance cand so 
cannot be regarded as having an important effect on perform¬ 
ance. The final conclusions from the analysis would therefore 
be that speed has a very marked effect on performance and 
intensity of illumination also has a marked effect, though not 
ho marked as that of speed, within the range of this particular 
experiment. The practical value of an experiment such as 
this is to show that in a human activity requiring reactions 
of the sort exemplified it is most important to control the 
speed factor for efficient performance, and also, though in a 
less degree, to control the factor of illumination. The optimum 
values of speed and illumination necessary for maximum 
efficiency cannot be deduced from the above analysis and 
would have to be the subject of further research. 

ll.vii. n-way classification. Analysis on precisely 
similar lines to those exemplified above may be made with 
data which may be classified in 4, 5 or more ways. The amount 
of arithmetic becomes progressively greater, however. For 
instance, with a 4-way classification. A, B,G and D, there will 
be 6 first-order interactions, AB, AC, AD, BC, BD and CD, 
and 3 second-order interactions, ABC, ABD and BCD, and 
the i*emainder will be a third-order interaction* ABCD. 
Similarly, with a 5-way classification there will be 10 first- 
order, 10 second-order and 5 third-order interactions, and the 
remainder will be b, fourth-order interaction. 



ll.Vii] ANALYSIS OF VARIANCE * 147* 

Although the analysis of variance in these cases presents no 
added difficulties, apart from the greatly increased arith¬ 
metical laj^our, in practice, particularly in the psychological 
field, the design of an experiment becomes increasingly 
difficult. If the main effects are to be directly comparable, it 
is essential that each reading shall be capable of being classified 
under each separate class heading. This increases the number 
of tests each subject has to perform and may introduce 
practice effects and fatigue effects which it may not be possible 
to isolate. 

In the analysis of the previous example the implicit assump¬ 
tion was made that the task required of the subjects was so 
simple *that neither practice nor fatigue would affect per¬ 
formance. This is by no means always the case and very great 
care has to be exercised in designing psychological experiments 
where variables such as these may enter. Such devices as 
taking the order in which the tests are done as one of the 
classes of variates, or taking matched groups of subjects, each 
of whom does only a fraction of the total tests, are useful on 
occasion. However, the data in such cases need handling 
with great care and the elementary student is strongly advised 
to seek expert help both in designing an experiment so that it 
will yield data of an appropriate typo and also in analysing 
the results. The normal tests of homogeneity are invalidated 
if the assumptions, previously mentioned, on which such tests 
are based are not warranted. It is.true that the analysis of 
variance may be carried out where the data are not normally 
distributed, where numbers in sub-groups are unequal and 
even when parts of the data are missing, but homogeneity 
tests in such cases are complicated and l£vborious and are in 
any case beyond the bounds of this simple introduction. 

EXERCISE ON CHAPTER XI 

j • 

38. Re-work and check the arithmetic of Examples 43, 44 and 45. 


10-2 



Answers to Exercises 


Test 

1-25 

26-50 

51-75 

76-100 

A 

43-20 

44*68 

44-52 

44*28 

B 

176-76 

176*72 

193-28 

193-72 

G 

28-70 

22-92 

23-12 

20*64 

D 

28-64 

33-56 

29-96 

30-00 

E 

125-00 

120-80 

l?4-40 

130*32 

F 

27-48 

27-20 

26-68 

27-72 

Q 

35-08 

38-40 

36-16 

37-52 

{a) 27-34. 

(6) 27*20. 



€ 

Test 

(a) 

(6) 

Discrepancy 


A 

44-14 

44*17 

0-03 


B 

185-40 

184-87 

0-53 


0 

23-76 

23*86 

0-08 


D 

30-38 

30-64 

0-16 


E 

126-20 

125*13 

0-07 


F 

27-18 

27-27 

0-09 


0 

36-76 

36*79 

0-04 


Test 

(a) 

(6) 

1 Test (a) 

(b) 

A 

46-5 

44-0 

E 

123*0 

130-0 

B 

167-0 

186-5 

F 

28-0 

25-5 

C 

26-6 

22-0 

0 

37-6 

39-0 

D 

30-5 

27-6 




Test 

(a) 

(b) 

1 Test (a) 

(b) 

A 

48-62 

43-20 

E 

123-20 

136-28 

B 

148-52 

169-60 

F 

29-32 

22-10 

0 

24-82 

22-24 

0 

39*02 

43-32 

D 

29-30 

22-64 





Lower 

Upper 

Semi-inter- 


Test 

quartile 

quartile 

quartile range 


A 

38-0 

60-0 

6-0 


B 

167-0 

203-0 

23-0 


C 

16-0 

31-0 

8-0 


D 

19-5 

40-0 

10-26 


E 

107-5 

142-0 

17*25 


F 

23-0 

32-0 

4*6 


0 

^9-0 

42-0 

, 6-6 




149 


ANSWERS TO EXERCISES 


7 . Test 

(a) 

(h) 

Test 

(a) 

(b) 

A 

7-66 

9-76 

E 

17-34 

17-32 

B 

24-08 

• 29-98 

F 

4-74 

6-62 

c , 

10-34 

8-80 

G 

7-18 

6-88 

D 

9-90 . 

1*2-60 




8 . (a) 20-67. 


(6) 16-06. 




9. Test 

(a) 

(h) 


(c) 

(d) 

A 

11-06 

8-18 


13-77 

10-81 

G 

12-87 

11-09 


11-93 

10-06 

D 

12-03 

* 12-33 


14-78 

15-05 

F 

6-99 

6-41 


6-90 

7-36 

0 

9-36 

8-43 


8-17 

8-68 

10 . Test 

S.D. 

V. 

Test 

S.D. 

V. 

A 

11-24 

26-46 

E 

21-13 

16-88 

B 

37-47 

20-21 

F 

6-46 

23-77 

C 

11-80 

49-62 

0 

8-70 

23-67 

D 

13-76 

46-29 





11. = 0-000002; /^j = 3-ll, 

7 i = 0-0014; S.e. = 0-245, 
yj = 0-ll; S.e. = 0-490. 

Hence the distribution does not depart significantly from normality. 

12. /5i = 0-0526; = 2*44, 

71 = 0-23; S.e. = 0-245, 

7j=- 0-56; S.e. = 0-490. 

Hence the distribution does not depart significantly from normality. 

13. 0. 

14. \P = 0-0668. There is therefore no evidence that people were 
not guessing. 


16. Mean difference = 0-12, t = 0-037. 

Hence the mean difference does not depart significantly from zero. 


16. 

Standard 



Standard 



error 

Difference 


error 

Difference 

• 

(o) 

1-421 

7-39 

id) 

1-620 

3-20 

(6) 

1-813 

6-60 

(e) 

1-628 

6-37 

(c) 

1-346 

3-40 

(/) 

1-084 

9-67 

All these < 

differences are 

significant. 


• 




160 ' ANSWERS TO EXERCISES 

17. (a) t = 0*62; difference not significant. 

(6) t = 1*70; difference not significant. 

(c) « = 1*40; difference not significant. 

(d) < = 1*29; difference not significant. 

(e) t = 1*94; difference not significant. 

(/) t = 4*42; difference significant. 

(g) «= 3‘63; difference significant. 

(h) ^ = 4*25; difference significant. 

18. V.R. = 3*80. This is less than the value of 4*21 approx, given 
in Fisher and Yates, hence the conclusion is not invalidated. 

19. Diff. between proportions = 0*19. S.'fe. of diff. = 0*0482. Diff. 
is significant, i.e. influenza is definitely more prevalent in the first 
town. 


20. (a) r = 0*216. (c) r = 0*088. 

(6) r = 0*246. (d) r = 0*266. 


22. 

A 

B 

G 

D 

E 

F 

Q 

A 

— 

0*372 

-0*268 

-0*262 

-0*075 

-0*395 

-0*168 

B 

0*372 

— 

-0*170 

-0*068 

0*106 

-0*057 

-0*076 

G 

-0*268 

-0*170 

— 

0*151 

0*236 

0*309 

0*161 

D 

-0*262 

-0*068 

0*161 

— 

0*032 

0*388 

0*106 

E 

-0*076 

0*106 

0*236 

0*032 

— 

0*338 

0*252 

F 

-0*395 

-0*067 

0*309 

0*388 

0*338 

— 

0*289 

0 

-0*168 

-0*076 

0*161 

0*106 

0*252 

0*289 

— 


This table illustrates a method of tabulating the coefficients of corre¬ 
lation between each pair of a set of tests. For example, the coefficient 
of correlation between test G and test F is given in the row headed G 
under the column headed F, or in the row headed F under the column 
headed (7, and is seen to be 0*309. 

23. Correlation between F and A; S.e. = 0*0844, 

„ „ F and B; S.e. = 0*0997, 

„ „ F and (7; S.e. = 0*0906, 

„ „ F and D\ S.e. = 0*0849, 

,, „ F and F; S.e. = 0*0886, 

„ ,, F and Q\ S.e. = 0*0916. 

The coefficient diffeis significantly from zero in every case except that* 
between F and B, 

24. Tjgg is significantly different from but not from any of the 
others specified. 

2^5. = 0*363, 

^0F,D ~ 0*276, 

''^FQ.E “ 0*224, 

, ^SF.Q ~ 0*286, 

, '^CE.B = 0*269. 


€ 







ANSWERS TO EXERCISES 


161 ' 


26. (a) Between H and G (1-26), p = 0*026, 

H and D (1-26), p = 0*141. 

(6) Between G and D (1-26), p = 0*171, 

F^om Exercise 20 Ja), r = 0*215. 

27. T = 0*440. 

28. Between H and G (1-25), r = —0*003, 

H and jD (1-25), T= 0*128, 

CandZ) (1-26), r= 0117. 

29. (a) For the regression of on F, 2 = 0*3387; hence the re¬ 
gression is not linear. (Alternatively, the variance ratio = 1*97. That 
in the Fisher and Yates’s 5 % point table for = 12 and = 60 is 
1*92; hence the regression departs from linearity.) 

For th% regression of F on Oy the variance due to deviations from 
linearity is less than that within arrays; hence the regression is linear. 

(6) = 0*630, Tf^y = 0*394. 


30. Between 


G 

P 

A and I 

6*821 

0*253 

0*.339 

B and I 

3*392 

0*181 

0*757 

G and/ 

8*987 

0*287 

0*174 

D and I 

13*835 

0*349 

0*032 Significant 

E and I 

16*432 

0*376 

0*012 Significant 

F and I 

60*086 

0*578 

0*000001 Significant 

G and I 

16*352 

0*376 

0*012 Significant 

31. Between 


P 


A and I 

0*62 

0*431 


B and I 

0*47 

0*493 

, 

G and/ 

2*52 

0*11*2 


D and I 

6*70 

0*017 

Significant 

E and I 

10*38 

0*0013 

Significant 

F and I 

38*30 

0*0000001 

Significant 

0 and/ 

8*20 

0*004 

Significant 

Hand I 

5*77 

0*016 

Significant 

• 

32. x’= 11-58. 

• 

P = 0*009, hence the association is significant. 

33. y*= 160 02. 

For this P is exceedingly small. Hence the hypo- 

thesis fits the observed data very badly and 

we conclude that the 


observer is unreliable. Note that he has an undue tendency to record* 
readings encSng in 0 or 5. 

34. y = 3* 12a? -f 1*686. Mean square for linear regression is 408*845, 
that for departures from linear regression is 0*265, hence the fit ii 
exceedingly good. 



162 ^ ANSWERS TO EXERCISES 

36. (a) d = 6. The triads are ABEA^ ACEA, ADEA, BCFB, 
BDFB, BEFB. 

(6) C = 0*671._ 

(c) Order is O ABCDE F* , « 

(d) The subject is inconsistent in his judgments. This may be 

because his ability is poor or because the task is too 
difficult. This could be decided only by further experi¬ 
ments. 

36. (a) u = 0*666. — 1> = 7*06, hence u is significant. 

(6) We should conclude that 0, A and F are quite distinct in 
the quality judged, but that there is probably no real 
difference between B, (7, D and E. 

Zl, W = 0*863. z = 1*4284, hence W is significant. 



APPENDIX A 15 S 

Table of V{»i .n^(ni+n^-2)l +Wj)} 



10 

11 

u 

13 

14 

16 

16 

17 

18 

19 

10 

9-49» 

9-98 

10-44 

l(k89 

11-33 

11-76 

12-16 

12-56 

12-93 

13-30 

11 

9*98 

10-49 

10-98 

11-46 

11-90 

12-34 

12-77 

13-18 

13-58 

13-97 

12 

10-44 

10-98 

11-49 

11-98 

12-45 

12-91 

13-36 

13-78 

14-20 

14-60 

13 

10-89 

11-46 

11-98 

12-49 

12-98 

13-46 

13-92 

14-36 

14-80 

15-22 

14 

11-33 

11-90 

12-46 

12-98 

13-49 

13-98 

14-46 

14-92 

16-37 

16-81 

16 

11-76 

12-34 

12-91 

13-46 

13-98 

14-49 

14-98 

16-46 

16-93 

16-38 

16 

12-16 

12-77 

13-36 

13-92 

14-46 

14-98 

16-49 

15-98 

16-46 

16-93 

17 

12-66 

13-18 

13-7a 

14-36 

14-92 

16-46 

16-98 

16-49 

16-99 

17-47 

18 

12-93 

13-58 

14-20 

14-80 

16-37 

1 16-93 

16-46 

16-99 

17-49 

17*99 

19 

13-30 

13-97 

14-60 

16-22 

16-81 

16-38 

16-93 

17-47 

17-99 

18-49 

20 

13-66 

14-35 

16-00 

16-63 

16-23 

16-82 

17-38 

17-93 

18-47 

18-99 

21 

lf02 

14-72 

15-39 

16-03 

16-65 

17-26 

17-83 

18-39 

18-94 

19-47 

22 

14-36 

16-08 

15-76 

16-42 

17-06 

17-67 

18-26 

18-84 

19-40 

19-94 

23 

14-70 

16-43 

16-13 

16-80 

17-45 

18-08 

18-69 

19-27 

19-84 

20-40 

24 

16-03 

16-78 

16-49 

17-18 

17-84 

18-48 

19-10 

19-70 

20-28 

20-86 

25 

16-36 

16-12 

16-86 

17-65 

18-22 

18-87 

19-61 

20-12 

20-71 

21-29 

26 

16-67 

16-46 

17-19 

17-91 

18-60 

19-26 

19-90 

20-53 

21-14 

21-73 

27 

16-98 

16-77 

17-53 

18-26 

18-96 

19-64 

20-30 

20-93 

21-66 

22-15 

28 

16-29 

17-09 

17-87 

18-61 

19-32 

20-01 

20-68 

21-33 

21-96 

22-67 

29 

16-59 

17-41 

18-19 

18-96 

19-68 

20-38 

21-06 

21-72 

22-36 

22-98 

30 

16-88 

17-72 

18-62 

19-28 

20-02 

20-74 

21-43 

22-10 

22-75 

23-38 

31 

17-17 

18-02 

18-83 

19-66 

20-36 

21-09 

21-79 

22-47 

23-13 

23-78 

32 

17-46 

18-32 

19-16 

19-94 

20-70 

21-44 

22-15 

22-84 

23-52 

24-17 

33 

17-74 

18-61 

19-45 

20-26 

21-03 

21-78 

22-60 

23-21 

23-89 

24-56 

34 

18-02 

18-90 

19-76 

20-67 

21-37 

22-12 

22-86 

23-67 

24-26 

24-93 

35 

18-29 

19-19 

20-06 

20-88 

21-68 ' 

22-46 

23-20 

23-92 

24-62 

26-31 

36 

18-66 

19-47 

20-36 

21-19 

22-00 

22-78 

23-63 

24-27 

24-98 

25-67 

37 

18-82 

19-75 

20-64 

21-49 

22-31 

23-10 

23-87 

24-61 

26-33 

26-04 

38 

19-08 

20-02 

20-92 

21-79 

22-62 

23-42 

24-20 

24-96 

26-68 

26-39 

39 

19-34 

20-29 

21-20 

22-08 

22-92 

23-73 

24-62 

26-28 

26-03 

26-75 

40 

19-60 

20-60 

21-48 

22-37 

23-22 

24-06 

24-84 

25-62 

26-37 

27-10 

41 

19-86 

20-82 

21-76 

22-66 

23-52 

24-35 

25-16 

26-94 

26-70 

27-44 

42 

20-10 

21-08 

22-03 

22-94 

23-81 

24-66 

26*47 

26-26 

27-03 

27-78 

43 

20-34 

21-34 

22-30 

23-22 

24-10 

24-96 

25-78 

26-68 

27-36 

28-12 

44 

20-68 

21-60 

22-66 

23-49 

24-39 

26-26 

26-09 

26-90 

27-68 

28-46 

45 

20-82 

21-86 

22-83 

23-77 

24-67 

26-54 

26-39 

27-21 

28-01 

28;78 

46 

21-oa 

22-10 

23-08 

24-04 

24-95 

25-83 

26-69 

27-62 

28-32 

.29-11 

47 

21-30 

22-34 

23-34 

24-30 

26-23 

26-12 

26-98 

27-82 

28-63 

29-43 

48 

21-63 

22-69 

23-60 

24-67 

26-60 

26-40 

27-28 

28-12 

28-94 

29-76 

49 

21-76 

22-83 

23-86 

24-83 

26-77 

26-68 

27-67 

28-42 

29-26 

30-06 

50 

21-98 

23-06 

24-10 

26-^9 

26-04 

26-96 

27-86 

*28-72 

29-66 

CO 



f64 


APPENDIX A 


APPENDIX A (cmt.) 

20 21 22 23 24 26 26 27 28 29 

-—-,p—-^- 

10 13-66 14-02 14-36 14-70 15-03 15-35 15-67 15-98 16-29 16-59 

11 14-35 14-72 15-08 15-43 15-78 16-12 16-45 16-77 17-09 17-41 

12 15-00 16-39 16-76 16-13 16-49 16-86 17-19 17-63 17-87 18-19 

13 16-63 16-03 16-42 16-80 17-18 17-66 17-91 18-26 18-61 18-95 

14 16-23 16-66 17-06 17-45 17-84 18-22 18-60 18-96 19-32 19-68 

15 16-82 17-25 17-67 18-08 18-48 18-87 19-26 19-64 20-01 20-38 

16 17-38 17-83 18-26 18-69 19-10 19-61 19-JO 20-30 20-68 21-06 

17 17-93 18-39 18-84 19-27 19-70 20-12 20-53 20-93 21-33 21-72 

18 18-47 18-94 19-40 19-84 20-28 20-71 21-14 21-65 21-96 22-36 

19 18-99 19-47 19-94 20-40 20-85 21-29 21-73 22-16 22-67 22-98 

20 19-49 19-99 20-47 20-94 21-41 21-86 22-30 22-74 23-17 r23-69 

21 19-99 20-49 20-99 21-47 21-95 22-41 22-86 23-31 23-75 24-18 

22 20-47 20-99 21-49 21-99 22-47 22-96 23-41 23-87 24-32 24-76 

23 20-94 21-47 21-99 22-49 22-99 23-47 23-96 24-42 24-87 25-32 

24 21-41 21-95 22-47 22-99 23-49 23-99 24-48 24-96 25-42 26-88 

25 21-86 22-41 22-96 23-47 23-99 24-49 24-99 26-48 25-95 26-42 

26 22-30 22-86 23-41 23-95 24-48 24-99 26-50 25-99 26-48 26-96 

27 22-74 23-31 23-87 24-42 24-95 25-48 26-99 26-60 26-99 27-48 

28 23-17 23-76 24-32 24-87 26-42 26-96 26-48 26-99 27-60 27-99 

29 23-69 24-18 24-76 25-32 25-88 26-42 26-96 27-48 27-99 28-50 

30 24-00 24-60 25-19 25-77 26-33 26-88 27-43 27-96 28-48 28-99 

31 24-41 25-02 25-62 26-20 26-78 27-34 27-89 28-43 28-96 29-48 

32 24-81 25-43 26-04 26-63 27-21 27-78 28-34 28-89 29-43 29-96 

33 26-20 26-83 26-46 27-05 27-64 28-22 28-79 29-36 29-89 30-43 

34 26-59 26-23 26-86 27-47 28-07 28-66 29-23 29-80 30-35 30-90 

36 26-97 26-62 27-26 27-88 28-49 29-08 29-67 30-24 30-80 31-36 

36 26-36 27-01 27-66 28-28 28-90 29-60 30-10 30-68 31-25 31-81 

37 26-72 27-39 28-04 28-68 29-31 29-92 30-62 31-11 31-69 32-26 

38 27-09 27-77 28-43 29-07 29-71 30-33 30-94 31-63 32-12 32-70 

39 27-46 28-14 28-81 29-46 30-10 30-73 31-36 31-96 32-65 33-13 

40 27-81 28-60 29-18 29-85 30-50 31-13 31-76 32-37 32-97 33-56 

41 28-16 28-87 29-66*30-22 30-88 31-63 32-16 32-78 33-39 33-99 

42 28-51 29-22 29-92 30-60 31-26 31-92 32-66 33-18 33-80 34-40 

43 28-86 29-58 30-28 30-97 31-64 32-30 32-96 33-68 34-21 34-82 

44 29-20 29-93 30-64 31-33 32-01 32-68 33-34 33-98 34-61 35-23 

46 ' 29-63 30-27 30-99 31-69 32-38 33-06 33-72 34-37 36-01 35-63 

46 29-87 30-61 31-34 32-06 32-75 33-43 34-10 34-76 36-/o 36-03 

47 30-20 30-96 31-69 32-41 33-11 33-80 34-47 36-14 35-79 36-43 

48 30-62 31-29 32-03 32-76 33-47 34-16 34-86 35-52 36-17 36-82 

49 ,30-85 31-62 32-37 33-10 33-82 34-62 35-21 36-89 36-66 37-21 

CO 31-17 31-94 32-70 33-44 34-17 34-88*36-58 36-26 36-93 37-69 



APPENDIX A 


APPENDIX A (cmt.) 





30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

10 

16-88* 1717 

17-46 

1^-74 

18-02 

18-29 

18-56 

18-82 

19-08 

19*34 

11 

17-72 

18-02 

18-32 

18-61 

18-90 

19-19 

19-47 

19-75 

20-02 

20-29 

12 

18-52 

18-83 

19-15 

19-45 

19-76 

20-05 

20-35 

20-64 

20*92 

21-20 

13 

19-28 

19-66 

19-94 

20-26 

20-57 

20-88 

21-19 

21-49 

21*79 

22-08 

14 

20-02 

20-36 

20-70 

21-03 

21-37 

21-68 

22-00 

22*31 

22*62 

22-92 

15 

20-74 

21-09 

21-44 

21-78 

22-12 

22-45 

22-78 

23-10 

23-42 

23 73 

16 

21-43 

21-79 

22-15 

22-50 

22-85 

23-20 

23-53 

23-87 

24-20 

24*52 

17 

22-10 

22-47 

22-84? 

23-21 

23-57 

23-92 

24-27 

24-61 

24-95 

25*28 

18 

22-75 

23-13 

23-52 

23-89 

24-26 

24-62 

24-98 

25*33 

25-68 

26-03 

19 

23-38 

23-78 

24-17 

24-55 

24-93 

25-31 

25-67 

26*04 

26-39 

26*75 

20 

24-00 

24-41 

24-81 

25-20 

25-59 

25-97 

26-35 

26*72 

27-09 

27*45 

21 

24-60 

25-02 

25-43 

25-83 

26-23 

26-62 

27-01 

27*39 

27-77 

28-14 

22 

25-19 

25-62 

26-04 

26-45 

26-86 

27-26 

27-35 

28*04 

28-43 

28*81 

23 

25-77 

26-20 

26-63 

27-05 

27-47 

27-88 

28-28 

28-68 

29-07 

29*46 

24 

26-33 

26-78 

27-21 

27-64 

28-07 

28-49 

28-90 

29-31 

29*71 

30*10 

25 

26-88 

27-34 

27-78 

28-22 

28-66 

29-08 

29-50 

29-92 

30-33 

30-73 

26 

27-43 

27-89 

28-34 

28-79 

29-23 

29-67 

30-10 

30-52 

30-94 

31*35 

27 

27-96 

28-43 

28-89 

29-35 

29-80 

30-24 

30-68 

31*11 

31*53 

31*95 

28 

28-48 

28-96 

29-43 

29-89 

30-35 

30 80 

31-25 

31-69 

32-12 

32*55 

29 

28-99 

29-48 

29-96 

30-43 

30-90 

31-36 

31-81 

32-26 

32-70 

33-13 

30 

29-50 

29-99 

30-48 

30-96 

31-43 

31-90 

32-36 

32-82 

33*26 

33*71 

31 

29-99 

30-50 

30-99 

31-48 

31-96 

32-44 

32-90 

33-37 

33*82 

34*27 

32 

30-48 

30-99 

31-50 

31-99 

32-48 

32-96 

33-44 

33-91 

34-37 

34-83 

33 

30-96 

31-48 

31-99 

32-50 

32-99 

33-48 

33-96 

34-44 

34-91 

35*37 

34 

31-43 

31-96 

32-48 

32-99 

33-50 

33-99 

34-48 

34-97 

35*44 

35*91 

35 

31-90 

32-44 

32-96 

33 48 

33-99 

34-50 

34-99 

35-48 

35*97 

36*44 

36 

32-36 

32-90 

33-44 

33-96 

34-48 

34:99 

35-50 

35*99 

36*48 

36*97 

37 

32-82 

33-37 

33-91 

34-44 

34-97 

35-48 

35-99 

36*50 

36*99 

37*48 

38 

33-26 

33-82 

3437 

34-91 

35-44 

35-97 

36-48 

36-99 

37*50 

37-99 

39 

33-71 

34-27 

34-83 

35-37 

35-91 

36-44 

36-97 

37*48 

37*99 

38-50 

40 

34-14 

34-71 

35-28 

35-83 

36-38 

36-91 

37-44 

37*97 

38*48 

38-99 

41 

34-57 

35-15 

35-72 

36-28 

36-84 

37-38 

37 •02 

38-45 

38-97 

39-48 

42 

35-00 

35-59 

36-16 

36-73 

37-29 

37-84 

38-38 

38-92 

39-45 

39-97 

43 

35-42 

36-01 

36-60 

37-17 

37-74 

38-29 

38-84 

39*39 

39-92 

40-45 

44 

35-84 

36-44 

37-03 

37-61 

38-18 

38-74 

39-30 

39-85 

40*39 

40-92 

45 

36-25 

36-86 

37-45 

38-04 

38-62 

39-19 

39*75 

40*30 

40-85 

41^9 

46 

36-6» 

37-27 

37-87 

38-47 

39-05 

39-63 

40*19 

40*76 

41-31 

•41-85 

47 

37-06 

37-68 

38-29 

38-89 

39-48 

40-06 

40-64 

41*20 

41-76 

42-31 

48 

37-46 

38-08 

38-70 

39-31 

39-90 

40-49 

41*07 

41*64 

42-21 

42-77 

49 

37-85 

38-48 

39-11 

39-72 

40-32 

40-92 

41*50 

42*08 

42*65 

43-22 

• 

50 

38-24 

.83-88 

39-51 

40-t3 

40-74 

41-34 

41*93 

42*51 

43-09 

43 66 



* 


APPBNPIX A 


tS6 


APPENDIX A (cont.) 



40 

41 

42 

43 

44 

46 

46' 

47 

48 

49 

10 

19-60 

19-86 

20-10 

20-34 

20-68 

20-82* 

21-06 

21-30 

21-63 

21-76 

11 

20-56 

20-82 

21-08 

21-34 

21-60 

21-85 

22-10 

22-34 

22-59 

22-83 

12 

21-48 

21-76 

22-03 

22-30 

22-56 

22-83 

23-08 

23-34 

23-60 

23-85 

13 

22-37 

22-66 

22-94 

23-22 

23-49 

23-77 

24-04 

24-30 

24-67 

24-83 

14 

23-22 

23-52 

23-81 

24-10 

24-39 

24-67 

24-95 

25-23 

26-60 

25-77 

15 

24-05 

24-35 

24-66 

24-96 

25-25 

26-64 

25-83 

26-12 

26-40 

26-68 

16 

24-84 

26-16 

26-47 

26-78 

26-09 

26-39 

26-69 

26-98 

27-28 

27-67 

17 

25-62 

26-94 

26-26 

26-68 

26-90 

27-21 

27-62 

27-82 

28-12 

28-42 

18 

26-37 

26-70 

27-03 

27-36 

27-68 

28-01 

28-32 

28*63 

28-94 

29-26 

19 

27-10 

27-44 

27-78 

28-12 

28-45 

28-78 

29-11 

29-43 

29-76 

30-06 

20 

27-81 

28-16 

28-61 

28-86 

29-20 

29-53 

29-87 

30-20 

30-62 

r.30-85 

21 

28-50 

28-87 

29-22 

29-58 

29-93 

30-27 

30-61 

30-95 

31-29 

31-62 

22 

29-18 

29-65 

29-92 

30-28 

30-64 

30-99 

31-34 

31-69 

32-03 

32-37 

23 

29-85 

30-22 

30-60 

30-97 

31-33 

31-69 

32-05 

32-41 

32-76 

33-10 

24 

30-50 

30-88 

31-26 

31-64 

32-01 

32-38 

32-76 

33-11 

33-47 

33-82 

25 

31-13 

31-63 

31-92 

32-30 

32-68 

33-06 

33-43 

33*80 

34-16 

34-52 

26 

31-76 

32-16 

32-56 

32-96 

33-34 

33-72 

34-10 

34-47 

34-85 

36-21 

27 

32-37 

32-78 

33-18 

33-58 

33-98 

34-37 

34-76 

36-14 

36*62 

35-89 

28 

32-97 

33-39 

33-80 

34-21 

34-61 

35-01 

35-40 

35-79 

36-17 

36-66 

29 

33-56 

33-99 

34-40 

34-82 

35-23 

35*63 

36-03 

36-43 

36-82 

37-21 

30 

34-14 

34-57 

36-00 

36-42 

35-84 

36-25 

36-66 

37-06 

37-46 

37-86 

31 

34-71 

36-15 

36-69 

36-01 

36-44 

36-86 

37-27 

37-68 

38-08 

38-48 

32 

35-28 

36-72 

36-16 

36-60 

37-03 

37-46 

37*87 

38-29 

38-70 

39-11 

33 

35-83 

36-28 

36-73 

37-17 

37-61 

38-04 

38-47 

38-89 

39-31 

39-72 

34 

36-38 

36-84 

37-29 

37-74 

38-18 

38-62 

39-06 

39-46 

39-90 

40-32 

35 

36-91 

37-38 

37-84 

38-29 

38-74 

39-19 

39-63 

40-06 

40-49 

40-92 

36 

37-44 

37-92 

38-38 

38-84 

39-30 

39-76 

40-19 

40-64 

41-07 

41-60 

37 

37-97 

38-45 

38-92 

39-39 

39-86 

40-30 

40-76 

41-20 

41-64 

42-08 

38 

38-48 

38-97 

39-46 

39-92 

40-39 

40-85 

41-31 

41-76 

42-21 

42-65 

39 

38-99 

39-48 

39-97 

40-46 

40-92 

41-39 

41-85 

42-31 

42-77 

43-22 

40 

39-50 

39-99 

40-48 

40-97 

41-45 

41-92 

42-39 

42-86 

43-32 

43-77 

41 

39-99 

40-60 

40-9Sf 

41-49 

41-97 

42-45 

42-92 

43-40 

43-86 

44-32 

42 

40-48 

40-99 

41-60 

41-99 

42-49 

42-97 

43-45 

43-92 

44-40 

44-86 

43 

40-97 

41-49 

41-99 

42-60 

42-99 

43-49 

43-97 

44-45 

44-93 

46-40 

44 

41-45 

41-97 

42-49 

42-99 

43-60 

43-99 

44-49 

44-97 

46-46 

45-93 

45 

4V92 

42-45 

42-97 

43-49 

43-99 

44-50 

44-99 

45-49 

46-97 

46-46 

46 

42-39 

42-92 

43-45 

43-97 

44-49 

44-99 

45-60 

45-99 

46-tk9 

46-97 

47 

42-86 

43-40 

43-92 

44-45 

44-97 

45-49 

45-99 

46-50 

46-99 

47-49 

48 

43-32 

43-86 

44-40 

44-93 

45-45 

4697 

46-49 

46-99 

47-60 

47-99 

49. 

43-77 

44-32 

44-86 

c 

45-40 

45-93 

46-46 

46-97 

47-49 

47-99 

48-50 

••60 

*44-42 

44-78 

46-32 

f 

45-87 

46-40 

46-9^ 

47-46 

47-97 

48-49 

48-99 



APPENDIX B 


167 


Curve showing values of N necessary for 



Conversely, if the number of observations, in a sample yielding 

a particular value of r is larger than that given by the curve corre¬ 
sponding to that value of r, then r is significant. * 

In borderline cases the sta|idard error of r should always be calcu¬ 
lated as a chdck. 





*168 


APPENDIX C 

Table of values of QjN (N* — 1) for different values of N 


The required values are the reciprocals of the numbers below. 


N 


10 

165 

11 

220 

12 

286 

13 

364 

14 

455 

16 

560 

16 

680 

17 

816 

18 

969 

19 

1140 

20 

1330 

21 

1540 

22 

1771 

23 

2024 

24 

2300 

25 

2600 

26 

2925 

27 

3276 

28 

3654 

29 

4060 


N 


30 

4495 

31 

4960 

32 

5466 

33 

5984 

34 

6545 

36 

7140 

36 

7770 

37 

8436 

38 

9139 

39 

9880 

40 

10660 

41 

11480 

42 

12341 

43 

13244 

44 

14190 

45 

16180 

46 

16216 

47 

17296 

48 

18424 

49 

19600 


N 


50 

20825 

51 

22100 

52 

23436 

53 

24802 

54 

26235 

55 

27720 

66 

29260 

67 

30856 

58 

32609 

59 

34220 

60 

35990 

61 

37820 

62 

39711 

63 

41664 

64 

43680 

65 

45760 

66 

47906 

67 

60116 

68 

52394 

69 

64740 


To obtain p by the use of the above table, divide S(d*) by the 
number in the table opposite the appropriate value of N and subtract 
the answer from 1. f 



159 


APPENDIX D 

Squares of numbers from 1 to 99 


No. Square 


1 

1 

2 

4 

3 

9 

4 

16 

5 

25 

6 • 

36 

7 

49 

8 

64 

9 

81 

10 

100 

11 

121 

12 

144 

13 

169 

14 

196 

15 

225 

16 

256 

17 

289 

18 

324 

19 

361 

20 

400 

21 

441 

22 

484 

23 

529 

24 

576 

25 

625 


No. 

Square 

26 

676 

27 

729 

28 

• 784 

29 

841 

30 

900 

31 

961 

32 

1024 

33 

1089 

34 

1156 

35 

1225 

36 

1296 

37 

1369 

38 

1444 

39 

1521 

40 

1600 

41 

1681 

42 

1764 

43 

1849 

44 

1936 

45 

2025^ 

46 

2116 

47 

2209 

48 

2304 

49 

2401 

50 

2500 


No. Square 


51 

2601 

52 

2704 

53 

2809 

54 

2916 

55 

3025 

56 

3136 

57 

3249 

58 

3364 

59 

3481 

60 

3600 

61 

3721 

62 

3844 

63 

3969 

64 

4096 

65 

4225 

66 

4356 

67 

4489 

68 

4624 

69 

4761 

70 

4900 


No. Square 

76 6776 

77 5929 

78 6084 

79 6241 

80 6400 

81 6561 

82 6724 

83 6889 

84 7056 

85 7225 

86 7396 

87 7569 

88 7744 

89 7921 

90 8100 

91 8281 

92 8464 

93 8649 

94 8836 

95 9025 


71 5041 

72 5184 

73 5329 

74 5476 

75 5625 • 


96 9216 

97 9409 

98 9604 

99 9801 



160 


APPENDIX E 

Numericalrmtffrial for exercises 

The following two pages of figures give the scores of 100 subjects 
in certain psychological tests. The scores in seven such tests are given 
in columns A to Q inclusive. Column H gives the ranked order of 
merit of the subjects in a scholastic examination, and column I gives 
an assessment of the practical ability of the subjects in some handi¬ 
craft. This assessment is in four categories—^Very Good (V.G.), Good 
(G.), Fair (F.) and Poor (P.). 

These two pages of scores provide the basis for most of the gxercises 
at the end of the successive chapters of the text. The student who 
wishes for additional exercises may easily make up others for himself 
from the figures given. 



APPENDIX E 


161' 


Subject 

A 

B 

O 

D 

E 

F 

Q 

H 

I 

1 

41 

164 

31 

46 

121 

31 

29 

60 

V.G. 

2 

49 

168 

.13 

30 

146 

31 

36 

12 

G. 

3 

41 

132 

20 

6 

108 

26 

26 

73J 

F. 

4 

36* 

166 

31 • 

26 

141 

30 

40 

284 

G. 

5 

36 

162 

30 

27 

96 

18 

28 

92 

P. 

6 

66 

166 

46 

33 

164 

36 

49 

4 

V.G. 

7 

60 

163 

38 

41 

102 

27 

40 

94 

F. 

8 

38 

166 

42 

62 

156 

31 

26 

11 

G. 

9 

49 

161 

30 

24 

147 

28 

39 

30 

F. 

10 

46 

166 

30 

37 

124 

27 

68 

62 

F. 

11 

28 

167 

30 

23 

98 

19 

33 

93 

F. 

12 

64 

204 


16 

92 

19 

18 

96 

P. 

13 

48 

203 

36 

35 

170 

36 

41 

1 

G. 

14 

31 

139 

2 

38 

126 

29 

36 

62 

G. 

16 

46 


41 

19 

127 

28 

38 

52 

F. 


49 

166 

44 

11 

131 

22 

33 

36 

F. 

17 • 

66 

240 

18 

20 

111 

17 

36 

60 

P. 

18 

46 

148 

26 

40 

105 

26 

21 

76 

P. 

19 

21 

138 

44 

31 

122 

39 

27 

64 

V.G. 

20 

27 

180 

30 

30 

100 

33 

60 

84 

G. 

21 

60 

277 

19 

27 

152 

34 

39 

18 

G. 

22 

49 

174 

6 

16 

125 

25 

27 

55 

F. 

23 

26 

176 

44 

38 

116 

34 

40 

73i 

G. 

24 

39 

167 

16 

29 

125 

19 

46 

66 

G. 

26 

62 

274 

9 

15 

131 

24 

25 

36 

P. 

20 

48 

167 

13 

10 

94 

22 

47 

95 

F. 

27 

44 

152 

7 

42 

80 

30 

30 

100 

V.G. 

28 

60 

187 

26 

36 

109 

24 

41 

87 

F. 

29 

49 

193 

12 

32 

114 

31 

23 

70 

P. 

30 

46 

196 

30 

40 

168 

26 

38 


G. 

31 

61 

184 

6 

10 

94 

25 

26 

97 

F. 

32 

49 

222 

22 

28 

137 

27 

40 

31 

G. 

33 

38 

196 

24 

63 

127 

36 

34 

67i 

V.G.- 

34 

41 

188 

31 

36 

137 

. 30 

37 

37 

G. 

36 

60 

234 

15 

28 

107 

36 

45 

72 

V.G. 

36 

62 

201 

6 

43 

121 

30 

34 

67i 

F. 

37 

32 

143 

39 

37 

142 

32 

61 

33 

G. 

38 

36 

177 

32 

35 

141 

28 

40 

33 

F. 

39 

44 

166 

21 

30 

116 

19 

39 

76 

P. 

40 

61 

246 

20 

49 

121 

29 

• 42 

59 

V.G. 

41 

39 

163 

42 

43 

106 

27 

27 

77 

F. 

42 

62 

181 

31 

43 

109 

29 

46 

71 

V.G. 

43 

47 

160 

19 

20 

129 

27 

55 

38} 

G. 

44 

29 

146 

14 

69 

133 

36 

42 

38} 

V.G. 

46 

48 

129 

23 

26 

107 

8 

23 

82 

P. • 

46 

4(4 

173 

28 

18 

149 

26 

64 

13 

P. 

47 

44 

164 

17 

30 

168 

13 

40 

6 

P. 

48 

29 

164 

63 

43 

90 

33 

37 

86 

G. 

49 

60 

177 

22 

16 

141 

32 

34 

33 

V.G. 

60 

40 

140 

21 

36 

101 

26 

3Q 

86 

F. • 


* 

CSC * ^ II 



*162 * APPENDIX E 


Subject 

A 

B 

C 

D 

jS7 

F 

0 

H 

I 

51 

72 

262 

4 

54 

116 

17 

22 

61 

P. 

52 

71 

160 

22 

17 

128 

28 • 

44 

40 

V.G. 

53 

51 

247 

4 

10 

130 

23 

40 

41 

F. 

54 

38 

205 

45 

34 

133 

' 29 

51 


V.G. 

55 

58 

266 

22 

26 

126 

19 

26 

42 

P. 

56 

49 

210 

34 

25 

141 

23 

39 

19 

G. 

57 

59 

139 

5 

27 

123 

24 

39 

49 

F. 

5S 

31 

162 

27 

38 

142 

24 

35 

21 

P. 

59 

40 

178 

17 

7 

84 

24 

33 

91 

P. 

60 

59 

149 

26 

25 

118 

24 

20 

63 

G. 

61 

34 

150 

27 

28 

90 

19 

27 

99 

F. 

62 

25 

216 

36 

34 

157 

39 

37 

9 

V.G. 

63 

39 

195 

24 

12 

151 

22 

29 

H 

P. 

64 

51 

216 

25 

57 

99 

25 

25 

88 

F. 

65 

50 

178 

27 

21 

146 

31 

30 • 

21 

G. 

66 

35 

283 

43 

31 

162 

35 

44 

2 


67 

15 

170 

19 

50 

127 

36 

39 

63 

• V.G. 

68 

40 

157 

39 

14 

139 

21 

45 

16i 

P. 

69 

24 

144 

12 

43 

82 

26 

43 

98 

F. 

70 

57 

236 

5 

16 

135 

21 

33 

21 

G. 

71 

48 

185 

19 

13 

124 

32 

39 

63 

F. 

72 

34 

146 

21 

56 

148 

37 

49 

15 

V.G. 

73 

45 

169 

35 

35 

96 

29 

40 

89 

V.G. 

74 

37 

182 

7 

25 

90 

26 

31 

90 

F. 

75 

51 

227 

33 

51 

123 

33 

44 

65 

V.G. 

76 

23 

149 

11 

46 

130 

32 

46 

44 

V.G. 

77 

45 

244 

14 

34 

113 

25 

39 

66 

F. 

78 

51 

228 

20 

15 

143 

32 

42 

23 

G. 

79 

41 

172 

48 

60 

127 

37 

40 

46 

G. 

80 

63 

216 

20 

9 

137 

16 

26 

24 

P. 

81 

39 

170 

7 

42 

107 

27 

39 

78 

G. 

82 

59 

234 

5 

22 

96 

25 

32 

79 

F. 

83 

45 

206 

31 

37 

147 

31 

40 

25 

G. 

84 

67 

180 

12 

9 

100 

12 

44 

82 

P. 

85 

42 

170 

22 

32 

137 

19 

65 

I6i 

F. 

86 

39 

198 

12 

20 

162 

23 

35 

3 

F. 

87 

34 

200 

17 

14 

145 

27 

28 

26 

G. 

88 

39 

148 

32 

30 

136 

35 

27 

46 

G. 

89 

20 

163 

33 

35 

153 

33 

28 

10 

F. 

90 

43 

187 c 

33 

58 

152 

41 

53 

14 

V.G. 

91 

37 

174 / 

29 

17 

142 

40 

47 

27 

G. 

92 

42 

322 ^ 

21 

40 

102 

23 

37 

68 

P. 

93 

46 

186 

24 

55 

145 

35 

41 

284 

G. 

94 

54 

214 

21 

19 

130 

30 

27 

47 

F. 

’ 95 

40 

167 

27 

15 

116 

25 

45 

67 

P. 

96 

45 

159 

23 

25 

142 

37 

44 


V.G. 

97 

52 

203 

7 

34 

96 

21 

21 

82 

F. 

98 

35 

197 

18 

20 

123 

22 

40 

48 

F. 

99 

47 

167 

7 

48 

115 

24 

29 

69 

G. 

ioo 

59 

189 

22 

14 

162 

21 

33 

6 

F. 



16 a 


A’PPENDIX F 


Values of n(n — 1) (2n + 5) for different values of n 


n 

n{n— 1) (2/1 + 6) 

n 

n(n--1) (2n + 5) 

n 

n(n— 1) (2n + 5) 

2 

18 

22 

22638 

42 

153258 

3 

66 

23 

25806 

43 

164346 

4 

166 

• 24 

29256 

44 

176956 

5 

300 

25 

33000 

45 

188100 

6 

610 

26 

37050 

46 

200790 

7 

798 

27 

41418 

47 

214038 

8 

• 1176 

28 

46116 

48 

227856 

9 

1656 

29 

51156 

49 

242256 

10 

2250 

30 

56550 

50 

257250 

11 

2970 

31 

62310 

51 

272850 

12 

3828 

32 

68448 

52 

289068 

13 

4836 

33 

74976 

53 

305916 

14 

6006 

34 

81906 

54 

323406 

15 

7350 

35 

89250 

55 

341550 

16 

8880 

36 

97020 

56 

360360 

17 

10608 

37 

105228 

57 

379848 

18 

12546 

38 

113886 

58 

400026 

19 

14706 

39 

123006 

59 

420906 

20 

17100 

40 

132600 1 

60 

442500 

21 

19740 

41 

142680 ! 




This table may be used in calculating the significance of t. 



464 


APPENDIX-G 

Values of {t—2)for different values of t 


t 

t(«-l)(«-2) 

t 


t 

<(«-!)(«-2) 

3 

6 

19 

5414 

35 

39270 

4 

24 

20 

6840 

36 

42840 

5 

60 

21 

7980 

37 

46620 

6 

120 

22 

9240 

38 

50616 

7 

210 

23 

10626 

39 

54834 

8 

336 

24 

12144 

40 

59280 

9 

504 

25 

13800 

41 

6^960 

10 

720 

26 

15600 

42 

68880 

11 

990 

27 

17550 

43 

74046 

12 

1320 

2S 

19656 

44 

79464 

13 

1716 

29 

21924 

45 

85140 

14 

2184 

30 

24360 

46 

91080 

15 

2730 

31 

26970 

47 

97290 

16 

3360 

32 

29760 

48 

103776 

17 

4080 

33 

32736 

49 

110544 

18 

4896 

34 

35904 

50 

117600 


This table may be used for calculating the significance of r when 
there are tied rankings. 



APPENDIX H 


Values of mn(m— 1) (n— 1)14: for different values of n and m 


n 

m= 2 

CO 

II 

g 

3 

3 

9 

4 

6 

18 

5 

10 

. 30 

6 

15 

45 

7 

21 

63 

8 

28 

84 

9# 

36 

108 

10 

.45 

135 

11 

55 

165 

12 

66 

198 

13 

78 

234 

14 

91 

273 

15 

105 

315 


II 

g 

m = 5 

m = 6 

18 

30 

45 

36 

60 

90 

60 

100 

150 

90 

150 

225 

126 

210 

315 

168 

280 

420 

216 

360 

540 

270 

450 

675 

330 

550 

825 

396 

660 

990 

468 

780 

1170 

546 

910 

1365 

630 

1050 

1575 


To calculate w, divide 2S by the appropriate number above and 
subtract L 


APPENDIX I 



Values of 

V for different 

values of n 

and m 

n 


m = 4 

m = 5 

m = 6 

3 

18 

9 

6-6* 

6-626 

4 

36 

18 

13-3* 

11-25 

5 

60 

30 

22*2* 

18-75 

6 

90 

45 

33-3’ 

28-126 

7 

126 

63 

46*6’ 

39-375 

8 

168 

84 

62-2* 

52-5 

9 

216 

108 

80 

67-5 

10 

270 

136 

100 

84-376 

11 

330 

165 

122-2’ 

103-125 

12 

396 

198 

146-6* 

123-75 

13 

^ 468 

234 

173-3’ 

146-25i 

14 

546 

273 

202*2* 

170-625 

16 

630 

315 

233-3’ 

196-875 


This table may be used in the testing of the significance of 



>166 


APPENDIX J 

Vahiea of ^ ^^ for different values of n and m 

Z \Z) \ Z j Tth ““ Z «» ( 


n 

m = 4 

m = 6 

m = 6 

2 

1*6 

3-3* 

6-625 

3 

4-6 

10 

16-876 

4 

9 

20 

33-76 

5 

15 

33-3* 

66-25 

6 

22*5 

60 

84-376 

7 

31-6 

70 

118-125 

8 

42 

93-3* 

167-5 

9 

54 

120 

202-6 

10 

67-6 

160 

253-126 

11 

82*5 

183-3* 

309-376 

12 

99 

220 

371-26 

13 

117 

260 

438-75 

14 

136-5 

303-3* 

611-876 

16 

167-5 

350 

590-625 


This table may be used in calculating the significance of u. 
For m = 3 this expression equals 0 for all values of n. 


APPENDIX K 


Values of m^(n^ — n)ll2for different values of n and m 


n 

m = 3 

ii 

m = 6 

m = 6 

3 

18 

32 

50 

72 

4 

45 

80 

125 

180 

6 

90 

160 

250 

360 

6 

167-6 

280 

437-5 

630 

7 

262 

448 

700 

1008 

8 

378 

672 

1060 

1512 

9 

640 

960 

1500 

2160 

10 

742-5 

1320 

2062-6 

2970 

11 

990 

1760 

2760 

3960 

12 

1287 

2288 

3675 

6148 

13 

1638 

2912 

4560 

6662 

14 

2047-6 

3640 

6687-6 

8190 

16 

2620 

4480 

7000 

10/^0 


To find TF, divide S by the appropriate number in the above table. 
If there are ties, divide S by the appropriate number from the table 
Uai 



Index 

References are to pages 


Analysis of variance, 85, 110, 129 
Arbitrary origin, 9 
Arithmetic mean, 6 
Average, 1, 6 

Binomial distribution, 30 
Biserial correlation, 75 

X*.93 

Coefiici«nt of agreement, 119 
of concordance, 124 
of consistence, 117 
of correlation, 50 
of mean square contingency, 95 
of rank correlation, 69, 72 
of regression, 81 
of variation, 21 
Contingency, 94 
Correlation, biserial, 75 

diagonal summation method, 58 
fourfold, 80 
partial, 66 
partial rank, 75 
product-moment method, 50 
ranking methods, 69 
Correlation ratio, 91 
Co-variance, 51 
Curve fitting, linear, 108 
logarithmic, 111 
polynomial, 113 

Diagonal summation method, 58 
Dispersion, 1, 16 
Distribution, binomial, 30 
normal, 24 
Poisson, 105 

Fit, goodness of, 105 
Fourfold correlation, 80 
Freedom, degrees of, 2 
Frequency%istribution, 1 
table, 6, 13 

Gaussian distribution, 25 
Goodness of fit, 105 


Histogram, 25 

Interaction, 140 
Inter-quartile range, 16 

Kurtosis, 30 

Leptokurtic curves, 30 
Linear regression, 81 
test for, 85 

Logarithmic curves. 111 

Mean, arithmetic, 6 

of correlation coefficients, 65 
Mean deviation, 17 
Moan square contingency, 96 
Mean variation, 17 
Median, 11 
Mode, 12 
Moments, 28 

Normal distribution, 24 
Normality, tost for, 27 
Null hypothesis, 94 

Paired comparisons, 116 
Parameters, 1 
Partial correlation, 66 
Partial rank correlation, 75 
Platykurtic curves, 30 
Poisson series, 105 
Polynomials, 113 
Probability, 27, 31 
integral, 2«7 
Probable error, 

of difference between means, 
of mean, 36 
of r, 61 

Product-moment method, 50 

Quartiles, 16 

Range, 16 

inter-quartilof 16 



168 


IKBBX 


Banking method, Spearman, 69 
KendaU, 71 

Bcmkings, combination of, 123 
Begression, coefficient of, 81 
linear, 81 
lines, 81 

test for linearity of, 86 

Seunple, 1 
random, 2 
Scatter, 1, 16 

Significance, meaning of, 27 
of coefficient of agreement, 123 
of coefficient of concordance, 126 
of coefficient of consistence, 
119 

of correlation coefficient, 61 
of difference between correlation 
coefficients, 63 

of difference between means, 
39 

of difference between proportions, 
46 

^ of single mean, 37 
of partial correlation coefficient, 
67 

of rank correlation coefficients, 
70, 74 

Spot diagram, 14 
Standard deviation, 17 


Stcmdard error, 
of correlation coefficient, 61 
of difference between means, 40 
of difference between proportions, 
46 

of mean, 36 
Stcutidard scores, 61 
Statistical methods, use and abuse 
of, 6 

Statistics as estimates, 2 
Symbols, description of, 3 

t test of significance, 38, 42 
Table of x\ ^7 

Table of r~z traiisformation, 63 
Table of significance of r, 62 
Table of t, 37 

Variables, 1 
Variance, 17 

analysis of, 85, 110, 129 
ratio, 45 

Variation, coefficient of, 21 
Working imits, 10 
Yates’s correction, 101 
z test, 87 

z transformation, 63 




